Stefan Says

Opinions on ASP.NET, ASP, databases etc
posts - 34, comments - 25, trackbacks - 23

Storing binary data in database - part 2, or why binary data are evil

This post is a follow up of my Should I store images/pdf etc in my SQL Server database?  where I try to argue against storing binary data such as a pdf file in the database. While it's not too hard (this article shows you how to store binary data in SQL Server) - in short my argumentation against it is that data in a database should be able to be used in other ways from within the database (queries etc). I.e not much of a point with just using the database engine as a storage engine.  
  
Here I go one step further and say - binary data (Word documents, PDF files etc) are evil and should
be completely avoided whenever possible. Yes, they are a great way of showing data, they are a great way of distributing data but they should NOT be used for data storage.

The ultimate solution - in the perfect world:

  • raw data (the actual numbers and words etc) are stored in strongly typed database tables.
  • when you need a PDF it should be dynamically generated from the data
  • when you need a Word document it should be dynamically generated from the data

Data is amongst the most valuable thing a company has (a lot more valueable than the actual systems) and therefore you should NEVER tie them to a specific format. Cause while applications will come and go - your data will (need to) last. Well, in the case of PDF I agree - it will be around for a LONG time, so will Word also, but there are a lot of nonstandard questionable data formats around and also there is another aspect:

While having the meeting secretary type the meeting protocol directly in Word might be convenient - there might be a day when someone ask you (the data nerd) to construct a list of all people who have attended the meetings. 

So you create a script, searching all *.doc in the directories where the protocols are stored, for each document you need to parse the files (using Word automation) looking for a pattern which will identify the section where the attendents are written. You finally get it going but suddenly the script breaks - turns out that they made a template change and the format has changed...

While the specific example is pretty fabricated, the need for these type of solutions are not.

Moreover, as data is your most valuable resource - don't buy a system where you don't have access to the actual data - and by access I not only mean physical access to the tables etc - but you must be able to understand the structure as well. It is your data - you should have the right to use it in other queries, aggregations etc outside of the system you bought. That's my opinion.

Remember it is also a fact that software companies comes and goes - but as I said - your data will probably need to last.

As I said, this is pretty much about how it could look like in the perfect world - there are indeed other factors playing a role in this. I have a client which sometimes requests some valuation analyzing reports from a (very respected) company  - and those are indeed delivered by PDF - end of story. Meaning you can't control everything.

It's a long way to go, but the direction described here must be the right way, agree?
 

  

Print | posted on Wednesday, September 06, 2006 9:45 AM