BEGIN TRANSACTION …

(by: Jasmin Azemović)

Varbinary vs. Filestream and other BLOB issues

leave a comment »

Varbinary vs. Filestream and other BLOB issues

It is well known that one result of the Internet’s rapid growth has been a huge increase in the amount of information generated and shared by organizations in almost every industry and sector. Problem is not only in generating data. Also what is with storing and accessing issues? Less well known, however, is the degree to which this information explosion has consumed huge amounts of expensive and valuable resources, both human and technical. These demands, in turn, have created an equally huge, but largely unmet, need for tools that can be used to manage what we call unstructured data. Admittedly, the term unstructured data can mean different things in different contexts. For example, in the context of relational database systems, it refers to data that can’t be stored in rows and columns. This data must instead be stored in a BLOB (binary large object), a catch-all data type available in most relational database management system (RDBMS) software. Here, unstructured data is understood to include e-mail files, word-processing text documents, presentations, image files, and video files.

According to a study by IDC, a leading Information Technology market research and analysis firm, the amount of data that would be captured, stored, and replicated worldwide would grow from 161 Exabyte’s in the year 2006, to over 988 Exabyte’s in 2010 (1exabyte= 1018 bytes). Over 95 % of the digital universe is unstructured data.

A majority of this data would be in the form of images, captured from a large number of devices, such as digital cameras, camera phones, surveillance cameras, and medical imaging equipment. Most of this data would need to be stored and managed in centralized systems within organizations. The study indicates that, by 2010, although enterprises will create, capture, and replicate only 30% of the digital universe, they will have to store and manage over 85% of all data in it.

According to this study, 80% of all stored data of organizations is unstructured. This growth trend is expected to continue into the future, there by mandating the need for efficient ways of storing searching, structuring, and providing security for unstructured data.

SQL Server jumps on this place as one of DBMS environment that is capable to handle all kind of data (structured and unstructured). There are two major data types to store BLOB (Binary Large Objects) into database structure (this article will not enter technical details of data types itself):

Binary (Fixed-length binary data with a length of n bytes, where n is a value from 1 through 8,000. The storage size is n bytes).

Varbinary (Variable-length binary data. n can be a value from 1 through 8,000. max indicates that the maximum storage size is 2^31-1 bytes).

Before SQL Server 2008 we had two options to store BLOB’s in our software environments:

  1. Inside relational database environment
  2. Outside relational environment

A point of contention usually arises with the topic of relational databases and binary large object (BLOB) data: Is it better to integrate BLOBs within the database or store them in the file system? Each method has its advantages and disadvantages.

 Inside relational database enviroment

Storing unstructured data such as images, audio files, and executable files in the database with typical text and numeric data lets you keep all related information for a given database entity together. And this approach enables easy search and retrieval of the BLOB data; you simply query its related text information. However, storing unstructured data can dramatically increase the size of your databases.

 

Main advantages of using this method:

  • When      the data is stored to the database, the backup is consistent. There is no      need for separate backup policy and creating incontinency;
  • When      the data is stored inside the database, it’s part of the transaction. So,      for example, a rollback includes all traditional database operations along      with binary data operations. This usually makes the client solution more      robust with less code.

 

Main disadvantages of using this method

  • storing      unstructured data can dramatically increase the size of databases;
  • backup      and restore time can take a long time;
  • performance      issues with I/O subsystems

Outside relational database enviroment

The common alternative to this technique is storing binary files outside the database, then including as data in the database a file path or URL to the object.

This separate storage method has a couple of advantages over integrating BLOB data within the database. It’s somewhat faster because reading data from the file system involves a bit less overhead than reading data from a database. And without the BLOBs, databases tend to be smaller.

However, we need to manually create and maintain a link between the database and external file system files, which have the potential to get out of sync. In addition, we usually need a unique naming or storage scheme for the OS files to clearly identify the potentially hundreds or even thousands of BLOB files.

Storing BLOB data within the database eliminates these problems by letting you store BLOB data along with its related relational data.

Main disadvantages of using this method:

  • When the data is stored outside of the database, the backup is not consistent;
  • Unstructured data is not part of the transaction

Main advantages of using this method

  • storing  unstructured data outside can decries the size of databases and reduce I/O throughput
  • backup and restore time can take a less time;

So far we have black or white. What every we choose there is some “bad” factors to handle. Fortunately world is not black or white, special SQL Server 2008 and above give some more light and cool options for storing BLOBs. I like to call it hybrid way.

Hybrid way for storing BLOB’s

Problem with first two methods is that we don’t know how actual size of data affects database performance. Results, so far, don’t tell us is there any difference in storing BLOB’s: 10kb, 1 MB, 100 MB etc.

In this case data is “outside” of database environment. But, major advantage is that BLOB’s are under database transactional consistency.

SQL Server database engine supports „new“ data type, filestream. Filestram support combines the benefit of accessing BLOBs directly from the NTFS file system with the referential integrity and ease of access offered by the traditional relational database engine. In SQL Server, BLOBs can be standard varbinary (max) data that stores the data in tables, or filesteram varbinary (max) objects that store the data in the file system. This research will show that the size and use of the data determines whether you should use database storage or file system storage. Filestream storage is implemented as a varbinary (max) column in which the data is stored as BLOBs in the file system. The sizes of the BLOBs are limited only by the volume size of the file system. The standard varbinary (max) limitation of 2-GB file sizes does not apply to BLOBs that are stored in the file system.

Little bit of retrospection. Filestream is actualy on of the legacys of on realy cool, advance and ahade of time project. WinFS (Windows Future Storage) was actialy atempt to build relational database storage above NTFS file system. But projects was canceled, I hope that this will be continued in some poit in the future. More info you can find here: http://en.wikipedia.org/wiki/WinFS and here: http://blogs.msdn.com/b/winfs/  

Ok let’s back to the topic. Now we will do some math and make comparative analysis of using classic varbianary and storing BLOB’s inside SQL Server tables. Opposite side is storing same staff but using filestream.

Lab environment

For this purpose I create setup testing environment with following components:

  • Processor: AMD X2 3.0 Ghz
  • 4 GB physical memory
  • Database files on C:
  • Log files on drive D:
  • Filestream storage on drive E:
  • Drives C, D and E are separate physical SATA disk drives
  • SQL Server 2008 R2 and custom testing client application are on the same machine

The following charts show average upload times for:

  • 100 KB file repeated 3 times for each measurement
  • 1 MB file repeated 3 times for each measurement
  • 10 MB file repeated 3 times for each measurement

In this measurement, you can clearly see the overhead and impact on performance caused by using filestream on “small” files. Time, in milliseconds, when storing inside database was every time less than 10 ms.  On the other hand time for storing using filestream is 4-5 time longer.

With 1 MB of data, the traditional varbinary and the file stream are acting quite similarly and difference between them is one time faster.

When 10 MB files are used, storing the data inside a traditional database file is much slower. Based on these measurements, it’s more efficient to use a filestream when the typical file size is about 1 MB or more. If files are small in size (clearly under 1 MB), a traditional storage performs better. When measuring the times, we found that the deletion of the filestream based rows is much, faster than when stored inside the table.

New way of using file system under database consistency was good opportunity to test new technology in different areas of storing data. Test examples: 10 KB, 1 MB and 10 MB in real information system environments can be: photos of users, CV files, small office documents, video and audio files. Based on content this analysis can be used to model system and clearly notice data storage needs. Benefit is getting maximum performance based on hardware and software infrastructure.  Also systems where performance issues already exist, this model of can help to identify bottlenecks and find a way to improve it. On the end finally decision is up to you, fortunately we have good choices to pick.

END TRANSACTION

Written by Jasmin Azemović

15/05/2012 at 22:06

Demistificirani SQL (2)

with one comment

Nastavljamo sa serijom postova ispitnih zadataka koji se ubrajaju u Hall of Fame :). Sljedeći primjer je klasik u toj kategoriji, gdje se se igramo sa funkcijama za rad sa stringovima koje mogu biti jako korisne. Upit koji slijedi postoji u nekoliko verzija (različite baze podataka), ali poenta je svaki put ista. Primjer glasi nekako ovako:

Vaša kompanija je odlučila da svojim zaposlenicima dodjeli nove email adrese. Za tu svrhu će se iskoristiti postojeći podaci iz korporativne baze podataka. Izlaz treba biti u formatu tri nove kolone: Email, Lozinka i Starost. Pred vas su postavljeni sljedeći uslovi:

  • Email se formira od podataka u kolonama: LastName, FirstName, City i to sljedećem formatu: LastName.Firstname@City.Com (sve malim slovima);
  • Lozinka se formira od podataka iz kolona: Notes, Title i Addeess na sljedeći način. Spajanjam kolona (Notes, Title i Addeess). Sljedeći korak jeste da se sadržaj spajanja okrene obrnuto (reverzno). Nakon toga, iz dobivenog stringa, preskačemo prvih 17 karaketra i uzimamo sljedećih 8. Na pojedinim mjestma će se nalazi znak – (crtica) isti je potrebno zamjeniti sa znakom @
  • Starost se formira na osnovu kolone BirthDate i trenutnog datuma

Uslov je da se mail, lozinka i godina starosti generiše samo za one klijente koji imaju unesenu adresu.

Baza podataka Northwind. (9 zapisa)

Rješenje je:

USE Northwind
SELECT LOWER (LastName + '.' + FirstName + '@' + City + '.com') AS Email,
REPLACE (SUBSTRING ( REVERSE ((CONVERT (nvarchar,Notes)+Title+Address)),17,8), '-', '@') AS Lozinka,
DATEDIFF (year, BirthDate, GETDATE()) AS Godine
FROM dbo.Employees
WHERE Address IS NOT NULL
ORDER BY Godine DESC

Toliko za ovaj put.

END TRANSACTION

Written by Jasmin Azemović

15/09/2011 at 16:03

Posted in Uncategorized

Demistificirani SQL (1)

with 3 comments

Odlučio sam da studentima povremeno postavim rješenje isptnih zadataka sa predmeta “Upravljanje bazama podataka” djela vježbi i workshoop(a) gdje se najviše radi sa SQL komandama (DML i DDL). Odabrit ću one upite koji koji su bili  “teži” tj. studenti su imali poteškoća prilikom implementacije rješenja. Naravno, ako osjetim da je to potrebno ova serija postova može postati redovna…ali to nije do mene nego do auditorijuma.

Cilj postova nije da studenti memorišu upite i na takav način pokušaju proći ispit (SQL kodiranje nije nešto što se može “nabubati”), nego baš suprotno da ih podstakne da samostalnu sjednu i nad oglednim bazam sami sebi zadaju određene probleme (tj. upite). Svako programiranje pa tako i SQL nije ništa više nego trening i to sigurno 90%, talenat je 10 %. Međutim, treningom se može namiriti “manjak” talenta.

Svi upiti se rade kroz SQL Server okruženje (SSMS) i sample baze koje su prevljenje za ovu database platformu.

Šta nam je potrebno:

  1. SQL Server 2008 R2, 2008 ili 2005 (Denali je također u opciji)
  2. Sample baze podataka (u zavisnosti od verzije SQL Servera preuzmite odgovarajuće baze: http://msftdbprodsamples.codeplex.com/)
  3. Nekada koristim “stare” sample baze Northwind i Pubs. Iste se mogu preuzeti na :http://www.microsoft.com/download/en/details.aspx?id=23654)
  4. Ako niste nikada prije instalirali SQL Server imate moj video tutorijal koji to objašnjava http://www.youtube.com/watch?v=ahG4EF8G-3c
  5. Ako imate problema sa sample bazama podataka pogledajte ovdje: http://www.youtube.com/watch?v=xG258yGjLG0

Za početak sam odabrao prvi upit sa roka odražnog 06.09.2010. Teksta zadatka glasi ovako:

Baza podataka je: AdventureWorksLT

Za svaku kategoriju proizvoda, prikazati njoj pripadajuće proizvode i to sa sljedećim atributima (naziv kategorije, naziv proizvoda, cijenu proizvoda, boju, datum početka prodaje, datum kraja prodaje i veličinu proizvoda). Pored tražene liste atributa upit treba zadovljiti sljedeće kriterije:

  • Potrebno je kreirati novu kolonu koja će na osnovu početka i kraja prodaje prikazati ukupan broj dana u kojim se prikazani proizvod prodaje ili se prodavao;
    Uslovi ili filteri za upit su:
    • Naziv proizvoda počinje oznakom HL ili je broj dana prodaje veći ili jednak od 364 i prodaja proizvoda je završena
  • Izlaz ne smije imati sljedeće
    • U imenu proizvoda oznaku M
    • Cijenu manju od 1000

Rješenje je:

USE AdventureWorksLT
SELECT PC.Name, P.Name, P.ListPrice, P.Color, P.SellStartDate, P.SellEndDate,
DATEDIFF (DAY, P.SellStartDate, SellEndDate) AS 'Period prodaje',
P.Size

FROM SalesLT.ProductCategory AS PC
INNER JOIN SalesLT.Product AS P
ON PC.ProductCategoryID = P.ProductCategoryID

WHERE  (P.Name LIKE 'HL%' OR DATEDIFF (DAY, P.SellStartDate, SellEndDate) >= 364)
AND SellEndDate IS NOT NULL
AND (P.Name NOT LIKE '%M' AND P.ListPrice > 1000 )

ORDER BY P.Size DESC

Toliko za ovaj put.

END TRANSACTION

Written by Jasmin Azemović

08/09/2011 at 10:55

Posted in Database, FIT, SQL, TSQL

Trening summary

with one comment

U sklopu MSCommunity aktivnosti, održao sam kratki trening na temu: Analiza, projektovanje i relaciono modeliranje u prostorijama Microsoft BiH (Unitic-Sarajevo, 11.08.2011). Trening je po mojoj ocjeni bio vrlo uspješan i što je najvažnije ekipa je bila jako komunikativna i sigurno željna da nauči i čuje nešto novo.

Trening je počeo u 10.00 i trajao do 15.00. Pet sati je malo za ono što sam želio da kažem. Ako uzmemo u obzir da navedenu problematiku predajem cijeli semestar, slika postaje jasnija. Iz navednog razloga trening je podjeljen u pet lekcija:

1. Osnovni pojmovi, značaj baza podataka i uvod u SQL Server

Svaka priča ima svoj početak. Naša je krenula sa definisanjem pojmova i postavljanjem baze podataka na njeno mjesto. Pod ovim mislim na sagledavanje važnosti i okvira u kojima se baza nalazi. Ako stvari posmatramo iz današnjeg ugla onda je jasno da praktično i nema informacijskog sistema, a da isti nije povezan sa nekim data storage-om. Lekcija se završavama sa uvodom u SQL Server na kojim se i vrti cijela priča.

2. Temelji baze podataka

Druga lekcija fokus pomjera na same temelje na koje se oslanja dobra baza (akcenat na dobra). Između ostalog smo naveli da  su to: Relacini model i normalizacija, dostupnost, redudantnost, skalabilnost, pohrana, sigurnost, audit i privatnost. Malo smo se detaljnije osvrnuli na  “dječije bolesti” koje poprimaju status pandemije, a to je mala ili nikakv a briga za privatnost i sigurnost podataka u bazama. Primarni sadržaj lekcije jeste relacini model, pojmovi PK/FK i proces normalizacije.

3. Faze analize i projektovanja

Svakako jednako važan dio kursa jeste prolazak kroz sve sve faze analize i projektovanja jednog informacijskog sistema, a sve u cilju donošenja ispravnih odluka u razvoju. U suprotnom greške u ovim fazama mogu dovesti do lančanih rekacija koje se kasnije protežu kroz kompletan životni ciklus softvera.

4. Projekta "Studentska služba"

Dio treninga gdje su polaznici na praktičnom primjeru sistema vidjeli primjenu prve tri faze analize i dizajna (tj. one koje najviše uključuju baze podataka). Modelirano je oko 30 % funkcionalnosti sistema. Uglavnom prototip može biti dobra osnova za daljne modeliranje.

5. Praktični dio

Svaki polaznik je odabrao jednu od ponuđenih tema i u nekih 45 minuta pokušao uraditi analizui i osnovni dizajn strukture baze podataka. Rezultati su bili jako dobri !

Resursi treninga

Prezentacija (PDF)

eSluzbaDB (model baze podataka) - napominjem da se kod modela baze nije “gubilo” mnogo vremena na opseg i tipova podataka (manjak vremena), ali svakao da ispravan odabir tipova i opsega predstavlja jednu od karika dobrog dizajna.

Video sadržaj (link na YouTube kanal, pojedinačni linkovi se nalaze ispod)

Cijelo predavanje traje oko 4 sata i podjeljeno je na tri dijela. Napominjem da sve snimamo uživo tako da postoji interakcija sa auditorijem i svi oni problemi kada se radi live :). Mikrofon je bio fiksiran tako da se u udređenim momentima može pojaviti oscilacija u jačini zvuka zbog kretanja kroz prostor. Sve u svemu vjerujem da će ova tri videa dati prikaza odlične radne atmosfere :)

Prvi dio video sadržaja.

Drugi dio video sadržaja.

Treći dio video sadržaja.

Ovaj trening bi doživou punu snagu kada bi se proširio na 2-3 dana. Tada bi se imalo mnogo više vremena za uključivanje elemenata koji nisu tu, ali svakako i za proširenje postojećih.

END TRANSACTION

Written by Jasmin Azemović

18/08/2011 at 12:14

Random data and SQL Server (re-write)

with 2 comments

Prije par sedmica imao sam potrebu da iz baze podataka izdvojim određeni broj zapisa i to slučajnim izborom. E sada tu su se pojavili razni problemi, SQL jezik nema komandu koja bi “zagrabila” TOP n zapisa, malo promiješala i izbacila potpuno “RANDOM“.

Npr. Ako želim da iz baze AdventureWorks (tabela Person.Contact) uzmem prvih 5 zapisa kucam komandu koja izgleda ovako:

SELECT TOP 5 ContactID, FirstName
FROM Person.Contact

Retultat izvršenja će biti poredan prema PK u ovom slučaju ContactID (jer nisam drugačije definisao sa ORDER BY klauzulom).

ContactID     FirstName
———–        ————-
1                    Gustavo
2                    Catherine
3                    Kim
4                    Humberto
5                    Pilar

(5 row(s) affected)

Koliko god se mi trudili da sa ORDER BY korigujemo goren navedeni upit i promjenimo redosljed sortiranja…nemamo baš puno opcija, ako želimo RANDOM listu.

Pokušao sam i sa TABLESAMPLE.

SELECT ContactID, LastName
FROM Person.Contact
TABLESAMPLE (10 PERCENT)

Međutim, nije baš random. BOL takođe kaže da ovo nije najbolje rješenje ako želimo slučajni izbor podataka. Šta sad?

Rješenje je generalno vrlo jednostavno, mada ima svojih prednosti i mana. Sigurno svi znamo za funkciju NEWID(). Ista generiše uniqueidentifier vrijednost koja je definitvno jedinstvena. To je svima nama dobro poznati 16 bitni GUID. Ako NEWID() svaki put generiše drugačiju GUID, zašto isti ne bi mogli primjenili u ORDER BY klauzili? Vratimo se na prvi upit:

SELECT TOP 5 ContactID, FirstName
FROM Person.Contact
ORDER BY NEWID()

Nakon izvršenja, redosljed sortiranja će bit drugačiji nego u listingu 1..pritisnimo ponovo F5, pa opet F5, pa opet F5….rezultat izvršenja će svaki put vratiti drugih 5 zapisa. Ovo je definitno riješilo moj problem. Možda će ilustrativnije biti ako se TOP 5 promjeni na TOP 1. Sada će te nakon F5 svaki put dobiti drugi zapis (kao i u prvom slučaju), ali se bolji primjeiti razlika.

Koji su side effects ove metode? Ima jedna mana (koliko sam ja mogao promjetiti..nisam puno ulazio u problematiku). Naime, na jako velikim tabalema  (par stotina hiljada zapisa ili nekoliko miliona) NEWID() treba da se generiše za svaki od tih zapisa i pa tek onda uradi ORDER BY. To je procedura koja može uzeti JAKO puno vremena. Tabela Person. Contact ima malo više od 19 000 zapisa tako da problem sa performansama nije baš uočljiv, ali ako pogledamo execution plan izvršenja našeg upita gdje se generiše NEWGUID() i po istom sortira (nije indekisran) onda stvar postaje malo jasnija.

Kako god,  meni je ovo riješilo problem, vjerujem da ako neko od Vas bude imao slične zahtjeve..NEWID() je nešto što definitivno trebate uzeti u razmatranje.

END TRANSACTION

Written by Jasmin Azemović

08/08/2011 at 22:28

Posted in Database, SQL, SQL Server, TSQL

SQL Server Best Of (OFF)

with one comment

Konačno sam dobio link na Channel 9 gdje je objavljeno moje predavanje održano u Sarajevu 26.05.2011 u sklopu TechDays 2011. Za oni koji više preferiraju YouTube, predavanje se može naću sklopu moga kanala i traje 50 minuta. Tema je bila izbor najboljih (po mome mišljnju) mogučnosti SQL Server okruženja koje nisu zastupljene na ovim prostorima. Naravno, referiram se na okruženje Bosne i Hercegovine, mada vjerujem da je u našem okruženju stanje vrlo slično.

Summary na engleskom jeziku:

SQL Server contains a lot of qualities that are not perhaps quite noticeable at first glance. Some things are much better resolved than in the RDBMS competitors. The aim of this presentations to know what is best of SQL Server.

Prezentacija (pptx)

PS: Ono OFF nije greška, ideja je da se Of interpetira kao OFF tj. isključene ili zanemarene funkcionalnosti :)

END TRANSACTION

Written by Jasmin Azemović

05/08/2011 at 22:16

Posted in Database, MVP, SQL Server

sp_MSforeachtable (re-write)

with one comment

Da, pokušam rezimirati problem koji sam imao prije par sedmica. Pojavila se potreba da u “staroj” bazi podataka, u svaku tabelu, dodam po jedan novi atribut tipa uniqueidentifier. Ništa posebno, ALTER TABLE ili desni klik na istu pa odabir opcije “Design”. Naravno, nema nikavih problema…osim jednog. Baza ima preko 150 tabela !!

Nisam mogao prihvatiti činjenicu da nema nekog  lakšeg, skrivenog načina za kaskadnu promjenu objekata bez obzira na njihov broj. Poznato je da je Microsoft, posebno u SQL Serveru krije određeni broj undocumented procedures. Radi se o skrivenim blokovima TSQL koda koji po običaju rade fantastične i cool stvari, ali problem je što je sve to “skriveno” od očiju globalne bazaške populacije. Kao po definiciji svaka nedokumentovane procedure nema podršku tj. drugim riječima, nemamo koga da pitamo. Parola je snađi se.

Rješene moga problema je pronađeno (nakon par sati lutanja) u proceduri:

sp_MSforeachtable

Iz samog imena “foreachtable” se vidi kako bi stvar trebala da funkcioniše. Na primjeru baze podataka pubs će Vam sve biti kristalno jasno. Dole navedeni primjer prvo dodaje GUID atribut tipa uniqueidentifier u sve tabele, a zatim isti i uklanja.

USE pubs
GO

-- Dodavanje iste kolone u sve tabele
EXEC sp_MSforeachtable 'ALTER TABLE ? ADD GUID uniqueidentifier NULL'

-- Uklanjanje prethodno dodane kolone
EXEC sp_MSforeachtable 'ALTER TABLE ? DROP COLUMN GUID'

Zaista bih volio da ima „skrivena“ procedura za promjenu imena svih objekata unutar baze ili servera, npr. da u ime svih sp’s dodamo „new“. Možda i ima :)

END TRANSACTION

Written by Jasmin Azemović

11/07/2011 at 11:32

Posted in Database, SQL, SQL Server, TSQL

Follow

Get every new post delivered to your Inbox.