What's wrong with SQL Server's FILESTREAM?

Arseni Mourzenko

Founder and lead developer

177

articles

November 14, 2014

Tags: database 2 short 50

Microsoft SQL Server contains a feature called FILESTREAM. For those who don't use SQL Server, in short, this feature allows to store very large chunks of data in a database with all the benefits and none of the drawbacks. Under the hood, Microsoft SQL Server stores just the list of files in the MDF (MDF being the file where the data of the database is stored), while the actual files are stored separately. The feature targets large files: files below 1 MB are expected to be stored in a varbinary(max) field.

This feature avoids the usual drawback of storing large chunks of data in a relational database. Usually, if you're storing a few thousand of videos of 4 GB each, you'll quickly come to conclusion that relational databases don't handle it... very well, and also that you need to buy a new server with ten times the RAM you have right now. With FILESTREAM, there is practically no impact of large files on the database. You can store chunks of 100 GB of data each, and it wouldn't matter.

I did it too. Maybe not 100 GB files, but videos from my collection, the larger being 5 GB, with approximately 100 files. I admit, this was a mistake.

The major benefit of using FILESTREAM—of storing files in a database without performance drawback—is the unified management of data. With this feature, I don't need to deal with those files separately: Microsoft SQL Server deals with them as if it were ordinary data. Especially, I don't need to care about a separate backup for those files: a maintenance plan of the database will back up those files as well.

Another benefit could have been the transactional capability, but with .NET's TransactionScope and NTFS transactions, you have the same benefit with ordinary files as well. This means that the single benefit remains the data management.

But here comes the problem: in most cases, we don't need unified management of data. Moreover, we need to separate as much as possible the management of large chunks of data from the management of pieces of information which are usually stored in a relational database.

Let's look at Adobe Lightroom model. Lightroom has two locations where interesting stuff is stored:

The list of photos, the meta (such as the keywords) and the history of every modification (such as contrast) of every photo are all stored in a catalog. The catalog is a database which is managed by Lightroom, shouldn't be accessed by any other application, changes frequently and grows over time. Lightroom does a daily (if configured properly) backup of the catalog.
The originals of the photos, i.e. the RAW files. Those files are often large (often more than 20 MB each), are never changed (but sometimes removed) and are stored as ordinary files. They are intended to be accessed by other applications, such as Adobe Photoshop. Lightroom knows where those photos are stored, but doesn't do any backup of them.

This distinction between constantly changing catalog containing small pieces of information and large data chunks and the way those two are handled by Lightroom leads to an observation that it doesn't make sense to manage those two types of information in the same way.

A catalog should be baked up incrementally, at regular intervals, as often as possible. Older backups can be removed after a few weeks.
Photos should be baked up once, as soon as possible, in order to be stored forever.

Now, what happens if photos are stored within the catalog? Right, you'll end up with an inadequate backup plan.

If photos and catalog are stored separately, you can use Amazon Glacier for the photos and Amazon S3 for the catalog. If photos and catalog are mixed and should be managed the same way, you'll end up paying much more to Amazon, because either your catalog will change too often for Glacier, or your photos will take too much space for S3.

Does it mean that FILESTREAM is useless?

Not at all. It can be used in some very specific applications with great success. It's just that it is extremely overrated, and people are using it in scenarios where they would better be using ordinary files.