Thursday, May 11, 2006
To BLOB Or Not To BLOB
A topical question is whether BLOBs should be served from a database. My colleague, Jim Starkey, invented the term BLOB and, understandably, is enthusiastic about the concept. However, others are concerned about the practical overhead. So, after some discussion, the answer is maybe, depending on your application. Here's a guide for whether hold your BLOBs in or out of a database:
- If your BLOBs are *huge* (hundreds of megabytes) then use a filing system.
- If the size of your BLOBs are tiny (sector size or smaller) then use a database.
- If you need scale-out then you've got a borderline case. Keep reading.
- If you've got a huge number of BLOBs then you've got another borderline case.
- If you want concurrency then you've got another borderline case.
- If you want relational structure (categories, tags, RSS) then use a database.
- If you want historical copies then use a database.
- If you'll only access by a hierarchical namespace or a trivial flat namespace then use a filing system.
Rule 1: Huge BLOBs
This is a matter of practicality. If you're serving disk images from a database then lumping multiple disk images into one database creates problems for backup which may impinge on unrelated applications. Obviously, the absolute BLOB size threshold will rise with increasing media capacity. However, the practical aspect remains. Don't put multiple disk images in a database table.
Rule 2: Tiny BLOBs
This a forte of databases. Filing systems are generally inefficient at storing small BLOBs whereas databases are very efficient at packing rows. Ergo, pack your BLOB in a row and use your database as an smarter, improved media manager. Plus, you get nifty features for free, such as indexing and replication. Even if the *average* size is smaller than one or two sectors then this makes perfect sense.
Admittedly, some filing systems, such as ReiserFS are very good at handling small BLOBs by creating inodes dynamically and allowing sectors to be shared. Unfortunately, this type of filing system cannot be guaranteed. However, you can insist that your application interfaces to specific implementations of SQL database. So, that's the current practice.
Rule 3: Scale-Out
This is more contentious. Some insist that BLOBs be served from a database for maximum scale-out. Others say that the increased database load creates unecessary pressure to scale-out. Some say that it is best to handle this data as static content and serve it from the presentation layer. Then others question the static nature of those BLOBs. It also precludes access rights.
I've seen the details of this argument while involved in yet another J2EE PetStore development. For those blissfully ignorant of PetStore, it is a reference 100% Pure Java, J2EE, ECommerce application which has been used as the basis of at least 200 ECommerce applications. Most are glorified shopping baskets, like the original app. All are incompatible. However, some have innovative features, such as very tight integration with couriers.
Anyhow, the management interface of the reference implementation allows product images to be uploaded. It places them in the uploaded images in the filing system. Do you see the problem yet? This is the filing system of the application layer and not the presentation layer. Therefore, it works on one server but it will fail in a scale-out environment, such as a WebLogic cluster. My suggested solution to have a static content server mounted by all application servers was rejected on many counts. Firstly, it was deemed unduly onerous. Secondly, it was deemed to be a single point of failure. Thirdly, it was deemed to be a security problem. Fourthly, it was deemed not to be in the spirit of 100% Pure Java. Fifthly, there were eroneous objections to mounts and paths being platform specific, despite the application already specifying such paths.
The scale-out solution is to write an EJB [Enterprise Java Bean] which serves BLOBs. This cunningly sidesteps special configuration and associated security problems. Unfortunately, this ignores code bloat and associated problems with code quality. If you're lucky, you'll get an EJB to serve BLOBs and an EJB to manage BLOBs. These will use their descriptors to locate a database facade via JNDI. If you're unlucky, you'll get a EJB to serve each type of BLOB and management EJBs with messaging and an XML facades using their own DTDs. Furthermore, each EJB creates its own database connection and the descriptors only work in one enterprise environment. I saw the latter.
The underlaying code wasn't brilliant either. This was to use the first version of Oracle's 100% Pure Java JDBC driver. Unfortunately, BLOBs were unimplemented (doh!) and CLOBs were capped at 4KB. So, it was a case of choose any two: Oracle9i, BLOBs in the database for scale-out and/or 100% Pure Java. Well, the latter was dropped on the basis that an improved connector should become available.
Matters are improved nowadays. However, if you want BLOBs in a database for scale-out then it won't solve your problems. It will only give you a different set of problems.
Rule 4: Huge Number Of BLOBs
If you've got a huge number of BLOBs then you'll either require too many inodes from your filing system or you'll have huge database tables. Both are problematic. Trivial cases are obvious candidates for using a filing system. If you're serving multimedia on the web then handle as static content and use DNS hacks to scale-out. Rich cases should consider properties in more detail.
Rule 5: Concurrency
If multiple users will be manipulating BLOBs then database involvement is obvious. However, this doesn't preclude references to stuff in a filing system. If updates come from multiple sources then the most suitable handling of BLOBs remains undecided.
Rule 6: Relational Structure
If you need to do relational stuff then use a database. If you want to tag items in multiple categories or access by criteria, such as dates, then use a database. A typical example is blogging. If you want to implement an RSS feed which sends the titles of the most recent articles then index your datestamps. This makes selecting the most articles into a trivial database operation.
If you want access rights and logging then this could be provided by either a database or specific filing systems. However, using a database requires access via an application. This allows you to implement your own rights and logging with the database implementation as a fallback. A filing system may or may not provide these facilities.
All of this assumes that your BLOBs aren't huge (Rule 1). In which case, use references. A 64KB limit isn't huge.
Rule 7: Historical Copies
Hold historical data inside a database. This eliminates filing system clutter. Again, this assumes that your BLOBs aren't huge (Rule 1).
However, don't index a status column if the majority of fields will eventually have the same value. The cardinality will initially be varied and therefore makes an ideal index for retrieval. However, as time progresses, the cardinality will fall and the index will eventually be ignored.
Rule 8: Hierarchical Access
If you've got this far then you don't have awkward BLOB sizes or quantities, or scale-out concerns. You don't require access rights, relational structure or historical copies. So use your filing system!
Any tip of the day? Any one have a nice link to a nice site about blobs...
In your case, you have logs and you know the format of your logs. Furthermore, you are handling a reasonable volume of data. In this case, don't store BLOBs in or out of your database. Store each log entry as a row in the database. Use appropriate datatypes. Normalize, if possible.