The Lightroom Catalog and Develop History States

5May2016

25 min read

Over the past couple of months I’ve been writing a lot about the progress of Lightroom over the 8 years it’s been around. And over that time I’ve been getting to increasingly technical details. This time I want to talk — and to be completely honest rant a little — about one very technical points of the Lightroom catalog and how Adobe stores data.

Also, let me also be up front about this, I don’t, nor have I ever, worked for Adobe. I don’t know precisely why they made the decisions they made, I’d like to believe they had good performance driven reasons to do what they did. At the same time, it’s entirely possible they did what they did because it was faster for them to write or saved them a couple of lines of code, and didn’t anticipate that people would end up with heavily fragmented multi-GB catalogs or that any problems would be masked by better hardware down the road.

In a previous post I lamented about how the unbounded growth of the preview cache folder can be problematic when working on systems with relatively small storage limits or where you intend to push the storage to the limits. Inefficiency, and unbounded growth, seem to the be rule rather than the exception when it comes to the inner workings of Lightroom. This time I’m going to talk a little about the develop history mechanic specifically how Adobe stores the data, and how the data grows and expands the catalog size tremendously.

The Premise of Lightroom’s Develop History

In my last post I touched on how Lightroom stores the develop settings for an image as structured data in the catalog as opposed to altering the pixels in the source image. Space efficient virtual copies are one extension of this mechanism. The second is the ability to persistently save an undo history of all develop steps to an extent and with persistence that’s not really possible in a bitmap editor.

Where a bitmap editor, like Photoshop, has to store every pixel value that’s change to be able to undo those changes, the vast majority of Lightroom’s adjustments could be stored as a handful of bytes. This enables Lightroom to be relatively aggressive in creating history states, and persisting them for the life of the image in the catalog.

All told, between editing an image and exporting or publishing it, it’s not uncommon to have many saved history states for any given image. In my library I average around 15 states for every image; however, because I don’t process every one of my images, that actually means I have a lot of history states for a comparatively few images.

Investigating History States

Unlike the rest of this Lightroom series, this post actually started as a specific investigation into a problem.

One of my friends has been using Lightroom almost as long as I have. Unlike me, though, he shoots a lot more than I do and his catalog is much larger than mine is. One day he called me up and asked me about a strange error he was getting when he started Lightroom; something about not being able to connect to something. If he let Lightroom load for a while, it would go away and Lightroom would work normally. Moreover, if he restarted Lightroom it wouldn’t always come up on subsequent restarts.

Not having seen the actual error, my first guess was that it was maybe something to do with Lightroom trying to connect to Adobe’s servers. The next time I was at his place, I took a look at it and started digging into what could be causing it. I quickly determined that the error had nothing to do with internet connectivity, it was something internal to Lightroom. For some reason my mind immediately jumped to the SQLite subsystem that is the core of the Lightroom catalog file.

The first place I looked was his catalog file; it was approaching 4 GB, and a quick look with a defragmenting tool showed that it was heavily fragmented.

While fragmentation has become much less of a concern with modern filesystems, like NTFS, it can still be a problem if the file is extremely fragmented. In this case, there was a 4 GB file, that didn’t have a fragment bigger than a few 10s of MB. That level of fragmentation, especially paired with spinning disks, created a significant decrease in disk performance and therefore increased loading time — which was what was ultimately causing the error that the server wasn’t responding.

I did a poor man’s defragmentation on the file by moving it to another drive and back. As an aside, it’s necessary to move the file off the file system (or copy it and remove the original) in order to get the OS to write the data out in as few fragments as possible — though be aware this method will not always work optimally.

That seemed to fix his problem, at least temporarily. But it also got me looking at the Lightroom catalog file.

He has quite a few images more than I do, but at the time he didn’t have 4 times as many and his catalog was 4 times larger than mine. At the same time, I have titles and captions on a lot of my images, and an extensive keyword library. While I wouldn’t expect the keywords to take up too much space, the titles and captions are all unique and all are several hundred bytes each. But that doesn’t explain why his catalog was disproportionally bigger than mine with less metadata involved. This suggested to me that there might be something up with what Adobe is storing in the catalog.

This is where that core open source technology SQLite comes in. Since the catalog isn’t a proprietary format, it’s possible to examine it (or potentially repair it) with readily available tools.

Step one: dump the table sizes

My first plan of attack was to look at how big the tabes were to see if there were any obvious problem spots.

SQLite provides a utility sqlite3_analizer, that will generate a whole slew of statistics for an SQLite file. These include table sizes, utilization, storage efficiency, and so forth.

*** Page counts for all tables with their indices *****************************

ADOBE_LIBRARYIMAGEDEVELOPHISTORYSTEP.............. 126636      50.5% 
ADOBE_ADDITIONALMETADATA.......................... 67200       26.8% 
ADOBE_IMAGEDEVELOPSETTINGS........................ 25920       10.3% 
ADOBE_IMAGEDEVELOPBEFORESETTINGS.................. 7420         3.0% 
AGMETADATASEARCHINDEX............................. 4136         1.6% 
ADOBE_IMAGES...................................... 3767         1.5% 
AGLIBRARYFILE..................................... 3356         1.3%

If the table containing the develop settings was consuming 10.3%, why would the develop history table be 50.5% of the my catalog file?

Sure, there should be more history states than current develop settings, but 5 times as much data stored? But in terms of bytes, that’s more than 400 MB of history data.

In any event, shouldn’t Adobe be storing develop settings as efficiently as possible?

So what does the history state table look like?

Running .schema Adobe_libraryImageDevelopHistoryStep in the sqlite3 client returns the following.

CREATE TABLE Adobe_libraryImageDevelopHistoryStep (
 id_local INTEGER PRIMARY KEY,
 id_global UNIQUE NOT NULL,
 dateCreated,
 digest,
 hasDevelopAdjustments,
 image INTEGER,
 name,
 relValueString,
 text,
 valueString
);
CREATE INDEX index_Adobe_libraryImageDevelopHistoryStep_imageDateCreated ON 
   Adobe_libraryImageDevelopHistoryStep( image, dateCreated );

That’s certainly not what I expected. What I expected was to see a whole slew of columns, one for each develop setting that needs to be stored. Maybe this was a naive view on my part.

Okay, lets pull a row from the table and see what’s actually being stored.

select * from Adobe_libraryImageDevelopHistoryStep limit 1;
id_local|id_global|dateCreated|digest|hasDevelopAdjustments|image|name|
  relValueString|text|valueString
928|21A0EDF0-3FF9-4503-B1BB-986330914768|465813406.266058|
b05afdbad359c8337b9bb6e663ca8aec|-1.0|916|
Import (10/6/15 04:36:46)||s = { AutoGrayscaleMix = true,
AutoLateralCA = 0,
AutoWhiteVersion = 134348800,
Blacks2012 = 0,
Brightness = 50,
CameraProfile = "Adobe Standard",
CameraProfileDigest = "BA45C872F6A5D11497D00CBA08D5783F",
Clarity2012 = 0,
ColorNoiseReduction = 25,
Contrast = 25,
Contrast2012 = 0,
ConvertToGrayscale = false,
DefringeGreenAmount = 0,
DefringeGreenHueHi = 60,
DefringeGreenHueLo = 40,
DefringePurpleAmount = 0,
DefringePurpleHueHi = 70,
DefringePurpleHueLo = 30,
Exposure = 0,
Exposure2012 = 0,
GrainSize = 25,
Highlights2012 = 0,
LensManualDistortionAmount = 0,
LensProfileEnable = 1,
LensProfileSetup = "LensDefaults",
LuminanceSmoothing = 10,
PerspectiveHorizontal = 0,
PerspectiveRotate = 0,
PerspectiveScale = 100,
PerspectiveVertical = 0,
ProcessVersion = "6.7",
RedEyeInfo = {  },
RetouchInfo = {  },
Shadows = 5,
Shadows2012 = 0,
SharpenDetail = 30,
SharpenEdgeMasking = 0,
SharpenRadius = 1,
Sharpness = 50,
ToneCurve = { 0,
0,
32,
22,
64,
56,
128,
128,
192,
196,
255,
255 },
ToneCurveBlue = { 0,
0,
255,
255 },
ToneCurveGreen = { 0,
0,
255,
255 },
ToneCurveName = "Medium Contrast",
ToneCurveName2012 = "Linear",
ToneCurvePV2012 = { 0,
0,
255,
255 },
ToneCurvePV2012Blue = { 0,
0,
255,
255 },
ToneCurvePV2012Green = { 0,
0,
255,
255 },
ToneCurvePV2012Red = { 0,
0,
255,
255 },
ToneCurveRed = { 0,
0,
255,
255 },
Version = "9.2",
WhiteBalance = "As Shot",
Whites2012 = 0 }

First reaction; WTF?

A Quick Primer on Datatypes and Storage

In a computer, all data is stored in a binary format — this is the whole ones and zeros thing. However, the meaning and arrangement of those 1’s and 0’s, and therefore what they ultimately represent varies with the data’s type.

Broadly speaking there are 3 types of data that computers deal with. Integer types store whole numbers (i.e. -5, 0, 1, 1000), and do so efficiently and in a format that virtually all CPUs can process natively. Floating point numbers store a representation of decimals or fractional data (i.e. 35.4 or -10,005.35), like integers most floating point numbers are generally stored in a standard format that can also be processed natively by CPUs. Finally strings store the representation of text as a series of integers that correspond to characters in a table.

Each of those types offer various pros and cons.

For example, an 8-bit/1-byte integer can store 2⁸ or 256 values. If the type is signed, meaning it can represent negative and positive numbers, those numbers range from –128 to 127. If the type is unsigned, it can store values form 0 to 255. Integers are always exactly what they represent, and the math is done in an exactly predictable way.

Floating point numbers are much more complex, and I’m not going to get into the details of them, but if you’re interested Wikipedia has a reasonable article about floating point numbers. Floats can represent huge ranges of numbers, but they are approximations due to the binary conversion, and these errors lead to some small levels of imprecision in calculations.

The final type I called text. Text is stored as a sequence of integers, but unlike integer values, the integers storing text represent a character not a number directly. Unlike integers or floats, a number stored in text is not something the computer can directly access and manipulate in hardware, it has to have some software to tell it how to translate the human mapped meanings into something it can process. Moreover, the amount of storage required to store a number in text is going to depend on the number of digits in the number. And this is the critical point with respect to what’s going on here in Lightroom.

For example, consider how one might store a value that ranges form 0 to 100 (like many of Lightroom’s sliders cover). To store this in an integer, only 100 values are needed, which is easily covered by the 256 possible values available from a single byte integer (1 byte = 8 bits = 2^8 = 255 options). On the other hand, if this is stored as text, it could use 1, 2, or 3 characters, and since each character requires at least 1-byte of storage, it could take as much as 3 bytes of memory to store the value.

In binary, an unsigned integer representing a value of 50 would read as 00110010. However, the textual representation ‘50’ would be two characters 5 and 0, which translate to ASCII values of 53 and 48, which in binary would be 00110101 00110000.

Now consider adding 5 to that 50. If the 50 in in binary, the computer can just add 5 (00000101) to the existing 00110010 it knows and they’re added with basically the same rules that you learned in elementary school only you carry when you sum up to 2 instead of 10. Anyway, the processor gets 00110111, which means 55 in binary.

On the other hand if you were trying to do this with the string representation, first some code would have to be called that understood how to convert the two characters into a computer usable number (and the 5 if that’s a string too). Then it would have to do the same math as done for the native computer usable type. Then if you wanted the data back as a string, convert the 55 back to two characters.

Which brings up a second aspect. Going from text to an integer — which is what the computer fundamentally requires to process data — requires more processing than if the computer already has the numbers in a format it can deal with.

Serialized Data in a Database

Before I started looking at the Library in SQLite, I had a mental model that each of the develop settings would be stored in their optimal forms as an integer, floats, or strings leveraging the database’s ability to relate values to fields to ascribe their meaning.

SQLite, like most SQL databases, has no significant problems in doing this. Each column is assigned a logical meaning, via a name, and the data stored in that column is understood to be that information implicitly by the programmers using it. Keep in mind, all of this is logical and for the benefit of the programmer, the name need not be meaningful, but it sure makes life a lot easier if it is.

Admittedly there are limits, but by default SQLite supports tables with up to 2000 columns, and can be configured to support up to 32,767 columns though it’s not recommended. Lightroom currently has about 60–65 develop settings that need to be kept track of, which is well below the 2000 column limit let alone the maximum limits.

Instead, Adobe is doing something different.

What Adobe is doing is serializing a Lua object and storing the resulting string in the database. Serialization does have some advantages, the primary one is that it allows for more simple interoperability.

Using serialization isn’t entirely unreasonable. One of the primary functions of serialization is to convert an object in the computers memory into a format that can be stored on disk without having to worry about specific formatting and implementation details. This is admittedly what Adobe is doing, saving the state of some object in Lightroom’s memory into the database.

However, serialization has limitations. One of those is that the serialized string form of an object generally will take up more space than the object did in memory. Some of that is the inherent inefficiency of converting native binary types into string representations of their value.

A second source of inefficiency is the addition of the characters that are needed to delineate and organize the representation of data. These are the curly braces, equals signs, and commas.

The Lua serializer that Adobe is using goes step further and added human readable flourishes to the data in the form of spaces and new lines. Fields are already separated by commas, and the field name and value are delineated by equals sings. There’s precious little reason to further pretty up the formatting for human readability. In fact, removing the spaces and new line characters reduce the size of the string by about 20%.

Probably the biggest source of inefficiency though comes from having to define labels so the software reading the serialized data knows what it’s reading. Those are all the human readable labels in the earlier sample.

In a database, those labels are defined essentially once, and every time you store another row of data, efficiency of those labels in terms of storage used improves. On the other hand, with serialized data, they have to be repeated every time you generate the serialized representation. Instead of a situation where the more you store, the less waste you have; the situation is that the waste increases at the same rate as the storage does.

Database Efficency

I wanted to test the theory that the database would be considerably more efficient if the individual settings were broken out into their own columns.

SQLite is a little different from most SQL database engines in how it stores data. Each column doesn’t need to have a size and type parameter specified for it — though you can as a hint. SQLite uses dynamic typing to store data in what it determines is the most efficient manner.

For example, integer types are stored using 1, 2, 3, 4, 6, or 8 bytes, depending on the value that’s sent to the SQLite engine. Store 50 and SQLite will use 1 byte. Store 200,000 and it’ll use 3 bytes.

I threw together a test program in Python to create two databases. One was loosely modeled after the table in Lightroom, though simplified to only have an index and the string column or the serialized text. The second database utilized separate columns for each of the 61 aspects that were stored in the test develop settings. I then inserted the data 10,000 times in each of the two tables.

Not unsurprisingly, the serialized string test database was much bigger than the columnar database; 20 MB versus 5 MB.

In my opinion a 75% smaller catalog is not something to sneeze at. For my more than 40,000 image catalog the reduction in the history steps table alone would shave 365 MB of my catalog file while retaining all the history. For my friends 4 GB catalog, of which 60% is history steps, reducing the storage requirements by 75% would shave 1.8 GB of disk space from his catalog.

Alternatively, I could delete all the image history in my catalog and free up about 450 MB. But in doing so I lose all that history information.

And keep in mind, the problem here isn’t that the catalog is big and disk space is in demand, it’s that the catalog has to be loaded into memory to be processed, and the larger the catalog file is, and especially the more fragmented it is, the longer this takes for that to happen and then you potentially get the problem I talked about at the start of this post.

As a secondary note, I was also curious about the performance of the serialized versus the serizlied tables. As such I timed, 10,000 insertions to see what kind of performance I was getting[1].

At least with Python’s implementation for SQLite, there was functionally no difference between a single big insert as a serialized string, and an insert with 61 separate parameters. The minor variations I did see, on the order of 1 microsecond, are within the realm of experimental error.

I should also point out the history step table isn’t the only place where Adobe serializes develop setting data into the catalog. The Adobe_ImageDevelopSettings table uses the same structure, and that accounts for a further 10% of my catalog size as well.

Storing Positional Data

There’s a second aspect to the history and develop data that I haven’t touched on but it’s also a major contributing factor to storage consumption by history steps and overall database efficiency. There are a number of places in Lightroom where the engine has to store x,y type positional information.

One example, and it’s seen it the same data earlier, is the tone curve. X and Y values are stored as sequential integers from 0 to 255. A straight linear mapping is stored at {0,0,255,255}, and a mapping with curved would be stored with more points such as {0,0,64,60,128,140,255,255}

Beyond the curve, there are several tools that store image positional information. Tools like the spot healing tool, the gradient tools, and a local adjustment brush tool. These all use regular intervals of x and y coordinates along with brush size and a definition set for the affected areas.

PaintBasedCorrections = { { CorrectionActive = true,
    CorrectionAmount = 1,
    CorrectionID = "CFD53BB4-F91E-4616-BFDE-ECE323554311",
    CorrectionMasks = { 
        { CenterWeight = 0,
            Dabs = { "d 0.846878 0.382692",
                "d 0.821662 0.399986",
                "d 0.795618 0.408876",
                "d 0.769187 0.411538",
                "d 0.743287 0.422603",
                "d 0.717651 0.436220",
                "d 0.692825 0.456801",
                "d 0.666778 0.461538",
                "d 0.640369 0.465035",
                "d 0.614742 0.477943",
                "d 0.593083 0.511154",
                "d 0.568074 0.528500",
                "d 0.548332 0.559867" },
            Flow = 1,
            MaskID = "39B9AC8F-BB01-4241-B67B-26AB767B356B",
            MaskValue = 1,
            Radius = 0.088226,
            What = "Mask/Paint" 
    },
    { CenterWeight = 0,
        Dabs = { "d 0.765612 0.659615",
            "d 0.791546 0.647717",
            "d 0.817480 0.635819",
            "d 0.843414 0.623920" },
    Flow = 1,
…

Note: I’ve indented the above excerpt to better show the structure, the spaces/tabs are not stored in the catalog.

As an aside, Lightroom’s painting is image size agnostic, the coordinates are saved as decimal percentages (between 0 and 1) with 6 decimal places of precision. This is especially handy as it enables Lightroom to use the same position data without multiple transformations on all export resolutions.

Using 6 places of precision also shouldn’t be a problem any time soon, as it will allow 1 pixel accuracy in images where the long edge is up to 100,000 pixels wide. For a 3:2 aspect ratio images, that would be a 6.66 giga-pixel image. For panos with high aspect ratios, a 100,000 pixel width will likely be substantially less resolution such that the limited might be approachable now or near future. However, 1 pixel accuracy is probably not necessary for most masks anyway given the coarseness of the brushes or the size of the images where it starts breaking down.

But back to the point about storage. These coordinate driven tools create significantly more data than the regular Lightroom adjustments do. Moreover, because Lightroom’s history steps are a complete snapshot of all settings at the time the history step was created, these large coordinate sets will propagate forward with subsequent history steps.

In other words, if you do any extensive work with the spot adjustment brush, spot healing tool, or the gradient tools (especially with the new masking capabilities), you can rapidity generate large history states that get stored in the database. In my database, I have a number of images with more than 10 MB of history data as a result of this behavior.

My recommendation is that whenever possible spot healing and adjustments and any masking of gradients should be kept to the end of the developing process to minimize the impact they have on ballooning catalog sizes due to their propagation in the history table.

Conclusions

I’m hesitant to just blast Adobe for doing what they did too hard. I’m not a Lua expert, and I don’t have access to the Lightroom source code (I’m even speculating that it’s mostly written in Lua). There may be very good reasons why Adobe has elected to store develop setting as serialized Lua

Thinking on that, about the only viable angle that I see is that by storing big text strings Adobe can add develop capabilities without having to update the catalog’s structure with a new column to represent the setting. When Adobe added dehaze capabilities, they only had to store that value in the string, and didn’t have to update the catalog’s tables to support the new feature.

At the same time, major Lightroom revisions have seen changes in the catalog of some structure in the catalog that has required the catalog file to up updated. Given the low computational bar for adding additional column to a table in SQLite — it wouldn’t even require doing something expensive like copying the table data — I’m not sure that that should be a significant consideration.

Certainly one could argue that with big modern disks, and certainly modern SSDs, the size of the Lightroom catalog isn’t that big of a deal. And that’s probably true, saving a couple 100 MB, or 1.8 GB in my friends case, aren’t that big of a deal when you have a 4 TB disk to store everything on.

On the other hand, as a programmer, the whole situation just looks messy to me on a number of levels. There’s a lot of extra data added to the serialized output that shouldn’t really be needed for it to be deserialize. Just removing spaces and new line characters reduced the size of the serialized test block by around 20%.

Further storage savings could be had by normalizing the coordinate based tools settings, even if they remained stored as serialized objects. This would allow much smaller index based references to be propagated from state to state instead of potentially multiple KB of repeated text entries.

Furthermore at least based on my cursory observations of the Lightroom’s catalog design, there’s a number of places where develop settings are stored redundantly. The Adobe_imageDevelopSettings table stores the current develop settings (in serialized form) and a reference to an entry in the Adobe_libraryImageDevelopHistoryStep table, which also stores the same serialized data.

There is also a Adobe_imageDevelopBeforeSettings table appears to store yet more serialized develop settings, presumably the import defaults that get applied to the image. However, the table is yet again filled with lots of large serialized objects as strings. And honestly, I’m not even sure why this table even exists.

Given that all three of these tables are referencing the same structure of data, and in the case of the develop settings and history table mirror identical datasets. There could be a whole lot of space savings by properly normalizing develop settings into their own table for all states both history and otherwise.

Bearing in mind, again, that between the history table, the develop settings table, and the before settings table, that accounts for nearly 64% of the data in the Lightroom catalog. Even just normalizing the develop settings and history steps should free up about 10% of the space used by my catalog storing redundant data.

Admittedly, the catalog is nowhere near the limits imposed by being an SQLite database. And there should be no added risk of data loss of corruption from having a large catalog file. The only negatives that should occur stem from slower random disk access and I/O loads especially on spinning disks, and even more so when the catalog is stored at the “ends” of spinning disks. Keep your catalog file on a fast SSD, and the size and fragmentation of the catalog shouldn’t become an issue until your catalog file is absolutely huge.

Tested on a Intel Xeon E3–1220v2, CPU governor set to power save, 8GB of RAM, 2 TB 5400 RPM mired ZFS pool with 8GB of SSD L2-ARC. ↩

Comments

Matt O’Brien

May 5, 2016 6:36 PM

Challenge. Is this possible.

I would really like to be able to have a means of calculating how much space is occupied by an image within a Lightroom catalog. This number does not need to be precise.

The reason.

I suspect that a relatively small proportion of the 100,000 images in my Lightroom catelog occupy a lot more space than I think or am comfortable with. It is also possible that most of these (rogue) images are more than a year old and which I no longer actively use. Also, I may have been too enthusiastic in my use of brush and filter tools in the early days of Lightroom,so have a lot of bloat which I would like to be able to manage.

If I could generate a listing of images sorted in descending sequence of size consumed in the Catelog, for say the top 500 images, then I could actively manage how I want to deal with them. I may simply convert these to tiffs, delete all related history, etc…. Or archive these to a different catelog. Although I am keen to maintain a single catelog, I place a higher value on having an optimised catelog.

Some tables in the catelog have a one to one relationship with an image. It is possible to calculate the number of bytes such a record occupies. Accumulate this number from the top ten tables in the Catelog.

I would be interested in your views.

Also, I always wondered if the brush strokes were recorded as pixels (ie blobs in a database) or as a series of algorithmic numbers representing the path of the brush and its related parameters.

Jason Franke | admin

May 5, 2016 7:26 PM

Challenge. Is this possible.

I would really like to be able to have a means of calculating how much space is occupied by an image within a Lightroom catalog. This number does not need to be precise.

It should be relatively simple to get a close enough value to be useful. Won’t be byte perfect, but would definately point towards the images with the most data in the table.

I’d have to sit down and think about the SQL, and you’d want to use an external scripting engine like Python to put everything together into a neat package, but it should be do able. No promises that I’ll do it, even though it would be interesting.

Also worth noting, since you use XMP files in addition to the catalog you could just search for the biggest XMP files in your library and there’s a good chance those will corresponding to the files with the biggest develop histories in the database. Not perfectly mind you, but it should be a good 0th order approximation.

I may simply convert these to tiffs, delete all related history, etc…. Or archive these to a different catelog. Although I am keen to maintain a single catelog, I place a higher value on having an optimised catelog.

There’s virtually 0 advantage to converting to Tiff, in fact it would make the net storage situation worse. Once you delete the history for the image from the catalog there’s only 1 (maybe 2) copies of the develop settings left in the database. If you converted to TIFF, you’d clean that up, however, the TIFF would be significantly bigger than the origional raw file, and would wipe out any savings in storage that came from the smaller catalog file.

Moreover, all files in the catalog have at least the default settings stored with them (this obviously is also inefficient in some ways), even though the values would be basically 0, they’d still be there in the catalog for the replacement TIFF.

If you do go looking to clean up big images, the best solution currently appears to be just clearing the history.

Also, I always wondered if the brush strokes were recorded as pixels (ie blobs in a database) or as a series of algorithmic numbers representing the path of the brush and its related parameters.

Brush like tools are stored algorithmicly. The “header” for the entry stores the bush diameter, feather, opacity, and of course the effect it’s applying. Position information is stored as 6- 6-place precision decimal percents (0-1) that represent the x and y offsets into the image.

Matt O’Brien

May 6, 2016 8:29 AM

Lots of useful stuff. Glad to get clarity on storage of brush strokes, etc..

Perfect size per image is not required. Approximations are good enough. Looking for trends and the large exceptions.

Reviewing the size of XMP files is a good idea.

I only use external xmp files on rare occasions, usually associated with travel.

I did a search of my system. I found approx 40,000 xmp files which consumed 380 Mb.

My catelog size is 2.18 Gb & 75k images. (It was 4 gb, 100k images but I have done a big tidy up recently.

A few images have xmp files close to 1 mb. Approx 50 are between 300kb and 1 Mb. The vast majority are 8-10kb.

I do my own printing. I will do my general editing in Lightroom, export to Photoshop, where the files are often in the 300-900mb size. Any detailed work is done in Ps. Return to Lightroom for some overall tweaks and softproofing before printing. I have stopped using brush tools for detail work in Lr.

I do not save my full catelog to xmp…… maybe I will try that and analyse the results. I will use your XMP file size trick to analyse what my top 50 /100/500 images are in terms of size. Will let you know if I find anything interesting.