BEGIN TRANSACTION …

(by: Jasmin Azemović)

Archive for the ‘Unstructured data’ Category

Varbinary vs. Filestream and other BLOB issues

with 8 comments

Varbinary vs. Filestream and other BLOB issues

It is well known that one result of the Internet’s rapid growth has been a huge increase in the amount of information generated and shared by organizations in almost every industry and sector. Problem is not only in generating data. Also what is with storing and accessing issues? Less well known, however, is the degree to which this information explosion has consumed huge amounts of expensive and valuable resources, both human and technical. These demands, in turn, have created an equally huge, but largely unmet, need for tools that can be used to manage what we call unstructured data. Admittedly, the term unstructured data can mean different things in different contexts. For example, in the context of relational database systems, it refers to data that can’t be stored in rows and columns. This data must instead be stored in a BLOB (binary large object), a catch-all data type available in most relational database management system (RDBMS) software. Here, unstructured data is understood to include e-mail files, word-processing text documents, presentations, image files, and video files.

According to a study by IDC, a leading Information Technology market research and analysis firm, the amount of data that would be captured, stored, and replicated worldwide would grow from 161 Exabyte’s in the year 2006, to over 988 Exabyte’s in 2010 (1exabyte= 1018 bytes). Over 95 % of the digital universe is unstructured data.

A majority of this data would be in the form of images, captured from a large number of devices, such as digital cameras, camera phones, surveillance cameras, and medical imaging equipment. Most of this data would need to be stored and managed in centralized systems within organizations. The study indicates that, by 2010, although enterprises will create, capture, and replicate only 30% of the digital universe, they will have to store and manage over 85% of all data in it.

According to this study, 80% of all stored data of organizations is unstructured. This growth trend is expected to continue into the future, there by mandating the need for efficient ways of storing searching, structuring, and providing security for unstructured data.

SQL Server jumps on this place as one of DBMS environment that is capable to handle all kind of data (structured and unstructured). There are two major data types to store BLOB (Binary Large Objects) into database structure (this article will not enter technical details of data types itself):

Binary (Fixed-length binary data with a length of n bytes, where n is a value from 1 through 8,000. The storage size is n bytes).

Varbinary (Variable-length binary data. n can be a value from 1 through 8,000. max indicates that the maximum storage size is 2^31-1 bytes).

Before SQL Server 2008 we had two options to store BLOB’s in our software environments:

  1. Inside relational database environment
  2. Outside relational environment

A point of contention usually arises with the topic of relational databases and binary large object (BLOB) data: Is it better to integrate BLOBs within the database or store them in the file system? Each method has its advantages and disadvantages.

 Inside relational database enviroment

Storing unstructured data such as images, audio files, and executable files in the database with typical text and numeric data lets you keep all related information for a given database entity together. And this approach enables easy search and retrieval of the BLOB data; you simply query its related text information. However, storing unstructured data can dramatically increase the size of your databases.

 

Main advantages of using this method:

  • When      the data is stored to the database, the backup is consistent. There is no      need for separate backup policy and creating incontinency;
  • When      the data is stored inside the database, it’s part of the transaction. So,      for example, a rollback includes all traditional database operations along      with binary data operations. This usually makes the client solution more      robust with less code.

 

Main disadvantages of using this method

  • storing      unstructured data can dramatically increase the size of databases;
  • backup      and restore time can take a long time;
  • performance      issues with I/O subsystems

Outside relational database enviroment

The common alternative to this technique is storing binary files outside the database, then including as data in the database a file path or URL to the object.

This separate storage method has a couple of advantages over integrating BLOB data within the database. It’s somewhat faster because reading data from the file system involves a bit less overhead than reading data from a database. And without the BLOBs, databases tend to be smaller.

However, we need to manually create and maintain a link between the database and external file system files, which have the potential to get out of sync. In addition, we usually need a unique naming or storage scheme for the OS files to clearly identify the potentially hundreds or even thousands of BLOB files.

Storing BLOB data within the database eliminates these problems by letting you store BLOB data along with its related relational data.

Main disadvantages of using this method:

  • When the data is stored outside of the database, the backup is not consistent;
  • Unstructured data is not part of the transaction

Main advantages of using this method

  • storing  unstructured data outside can decries the size of databases and reduce I/O throughput
  • backup and restore time can take a less time;

So far we have black or white. What every we choose there is some “bad” factors to handle. Fortunately world is not black or white, special SQL Server 2008 and above give some more light and cool options for storing BLOBs. I like to call it hybrid way.

Hybrid way for storing BLOB’s

Problem with first two methods is that we don’t know how actual size of data affects database performance. Results, so far, don’t tell us is there any difference in storing BLOB’s: 10kb, 1 MB, 100 MB etc.

In this case data is “outside” of database environment. But, major advantage is that BLOB’s are under database transactional consistency.

SQL Server database engine supports „new“ data type, filestream. Filestram support combines the benefit of accessing BLOBs directly from the NTFS file system with the referential integrity and ease of access offered by the traditional relational database engine. In SQL Server, BLOBs can be standard varbinary (max) data that stores the data in tables, or filesteram varbinary (max) objects that store the data in the file system. This research will show that the size and use of the data determines whether you should use database storage or file system storage. Filestream storage is implemented as a varbinary (max) column in which the data is stored as BLOBs in the file system. The sizes of the BLOBs are limited only by the volume size of the file system. The standard varbinary (max) limitation of 2-GB file sizes does not apply to BLOBs that are stored in the file system.

Little bit of retrospection. Filestream is actualy on of the legacys of on realy cool, advance and ahade of time project. WinFS (Windows Future Storage) was actialy atempt to build relational database storage above NTFS file system. But projects was canceled, I hope that this will be continued in some poit in the future. More info you can find here: http://en.wikipedia.org/wiki/WinFS and here: http://blogs.msdn.com/b/winfs/  

Ok let’s back to the topic. Now we will do some math and make comparative analysis of using classic varbianary and storing BLOB’s inside SQL Server tables. Opposite side is storing same staff but using filestream.

Lab environment

For this purpose I create setup testing environment with following components:

  • Processor: AMD X2 3.0 Ghz
  • 4 GB physical memory
  • Database files on C:
  • Log files on drive D:
  • Filestream storage on drive E:
  • Drives C, D and E are separate physical SATA disk drives
  • SQL Server 2008 R2 and custom testing client application are on the same machine

The following charts show average upload times for:

  • 100 KB file repeated 3 times for each measurement
  • 1 MB file repeated 3 times for each measurement
  • 10 MB file repeated 3 times for each measurement

In this measurement, you can clearly see the overhead and impact on performance caused by using filestream on “small” files. Time, in milliseconds, when storing inside database was every time less than 10 ms.  On the other hand time for storing using filestream is 4-5 time longer.

With 1 MB of data, the traditional varbinary and the file stream are acting quite similarly and difference between them is one time faster.

When 10 MB files are used, storing the data inside a traditional database file is much slower. Based on these measurements, it’s more efficient to use a filestream when the typical file size is about 1 MB or more. If files are small in size (clearly under 1 MB), a traditional storage performs better. When measuring the times, we found that the deletion of the filestream based rows is much, faster than when stored inside the table.

New way of using file system under database consistency was good opportunity to test new technology in different areas of storing data. Test examples: 10 KB, 1 MB and 10 MB in real information system environments can be: photos of users, CV files, small office documents, video and audio files. Based on content this analysis can be used to model system and clearly notice data storage needs. Benefit is getting maximum performance based on hardware and software infrastructure.  Also systems where performance issues already exist, this model of can help to identify bottlenecks and find a way to improve it. On the end finally decision is up to you, fortunately we have good choices to pick.

END TRANSACTION
Advertisements

Written by Jasmin Azemović

15/05/2012 at 22:06