If you have been working with Qlik for a bit I am sure you have heard of QVDs. If you have been working with Qlik a while hopefully you have heard of optimized QVD loads and non-optimized QVD loads. Either way, hopefully I can shed some of the mystery around QVDs and these two loads.
What is a QVD?
A QVD is a proprietary data file format made by Qlik which is made up of two parts:
- XML metadata header
- This contains the data around the data…. Number of records, fields, tags, etc..
- Binary data
- This is the actual data which has been compressed in a way unique to Qlik. One of Qlik’s key differentiators.
How is data stored in a Qlik application?
This is a crucial point for so many things in the Qlik world. There is a wonderful blog post by Henric Cronström who explains how Qlik stores data internally which you can read here. I advise any Qlik developer to read this, and read it again. This will drive so many decisions for optimization.
To summarize, Qlik stores data in a columnar format in the sense that each field has its own structure. There is a distinct list of values for a field which contain a mapping of the value and a reference point. This reference point (bit-stuffed pointer) is what is populating the data tables. So ultimately there is only one single value stored for each field value. All duplicate values for the rows in the table are references. This allows the data to be compressed immensely. This is also why field cardinality is the such a large driver to the Qlik application size.
QVD Data Storage
So if Qlik has this special data structure / storage mechanism, how does this affect QVDs? Well QVDs binary data is stored in a very similar format as it exists in memory.
Optimized QVD Loads
Optimized QVD Loads, in a simple explanation, taking the data from the QVD and pushing it directly into memory. There is no processing or time necessary to read or interpret the data, it is already in the form it needs to be in. The time it takes is mainly moving the file from disk to memory. It really is a feat that Qlik can load hundreds of millions of rows in seconds.
Unoptimized QVD Loads
Now knowing what an optimized QVD load does, it makes more sense of what causes an unoptimized QVD load. Once you perform a field transformation or
where condition, it becomes unoptimized; with a few exceptions. This is because Qlik can’t just directly push the data into memory anymore. No the data has to be unwrapped and modified before it is stored in memory. This is still typically faster than other data formats such as flat files, databases, APIs, etc… but it is nowhere near as fast as an optimized load. This is why the Qlik community hold it in such a high regard.
What breaks a QVD Load?
Honestly, it makes more sense to describe what you can do and still keep a QVD load optimized.
- Renaming fields.
- You can load a field twice, with a different name of course.
- Simple Where Exists.
- Not Simple:
- The field used in the
Existsclause must be in the load script.
, Dim, Amt From [Sales.qvd] (qvd) Where Exists(
- Concatenate to a table, only if it contains all of the fields of the table it is concatenating to.
- QVD can contain additional fields.
So to summarize what would break an optimization:
- Field transformation
- New field: expression, number, string, etc…
- Where & While conditions
- Concatenating a QVD to a table which has fields not in the QVD
I will admit this is a bit limiting. There are going to be use cases there the data needs to be modified in some way. My suggestion is frame out your load to as to best leverage QVD loads to optimize getting data into you application. Sometimes this will be more round about, but will likely be faster.
- 400 Million Rows
- 15 Years of Data
- 3 Company Ledgers
- Rolling Three Years
- 0L Ledger Only
Traditional Load Script (Unoptimized)
How can I leverage a QVD optimized load to get the smallest data set I need in the quickest way?
Assuming all things are quite equal:
- 400M rows / 15 years = ~27M per year. 3 years = ~100M rows
- 400M rows / 3 ledgers = ~133M rows per ledger. 1 ledger = 133 rows.
3 years of data seems to be the smaller cut of data.
I would want to leverage the simple where exist condition to filter down to the three years of data and then further filter down to a single ledger.
Even though there are more steps, it will cut down the reload time greatly. This is another example of how you should test different scenarios and strategies to determine which is faster. In our example, it may be the the 0L ledger only accounts for 10% of the data and is much smaller than three years of the total dataset.