Performance and Incremental loads

Incremental loads


All data sources grow over the period and when dealing with large volume of data, time it takes to reload data increases along with growth of data.


Incremental load is a technique whereby data is read from source and stored on a local drive as a QVD file. Subsequent loads are first done from the QVD file, then only the new (or updated) data is read from the source and finally saved again as a replacement QVD file ready for the next time.  

There are couple of techniques that can be used to accomplish an incremental load, and which of this to use depends on a couple of factors you need to establish.

Firstly, establish where the data is coming from, is it a text file (Such as log file), a spreadsheet or a database table?

Secondly, is data just added to the source or is it uploaded? Or are there records possibly deleted too?

The simplest from is when reading a text file (Such as system log), where data is only added to the end. For this, there is simple prefix to the QlikView load instruction.


The following script will read the entire QlikView session log.

Sessionlog:
LOAD *
FROM
[C:\ProgramData\QlikTech\QlikViewServer\Sessions_QLIKVIEWSRV.log]
(
txt, utf8, embedded labels, delimiter is '\t', msq);

Every time the document is reloaded, the entire file will be read, however we could change it to the following.

Sessionlog:
Buffer(Incremental)
LOAD *
FROM
[C:\ProgramData\QlikTech\QlikViewServer\Sessions_QLIKVIEWSRV.log]
(
txt, utf8, embedded labels, delimiter is '\t', msq);

You will notice the addition of a single line Buffer(Incremental) which tells QlikView to create its own temporary file the first time the file is read and to remember the size of the source file. Next time it will read the QVD file first and add to it anything from that saved point onwards before finally saving the revised QVD file and pointer.

This method works well and is easy to implement but can only be used for simple text files (Including CSV) where data is only appended.

If data is coming from more complex source, such as database table or spreadsheet, a different approach is needed. First of all, it is necessary to establish whether there is a field that can be used as a marker. Ideally there is an autonumber record ID that is increased with every record, a timestamp field when a record is written (or updated), or a data field such as order created.(Care must be taken here as performing multiple refreshes in one day could cause issue with duplicates.)
If you cannot identify a suitable field, the incremental load approach is not possible.

In this example, it is necessary to write lot more script to handle this approach as these needs to be resilience for time when total refresh is required or for the first time the document is loaded.

Take a look at the following script.

If IsNull(FileTime('ShopSales.qvd')) then
    
     ShopSales:
    
LOAD [Shop No],
            
[Product Code],
            
Quantity,
            
[Sale Date]
    
FROM [QlikView Unlocked Data.xlsx]
     (
ooxml, embedded labels, table is [Shop Sale]);
    
ELSE
     ShopSales:
    
LOAD * From 'ShopSlaes.qvd' (qvd);
    
     Temp:
    
LOAD Max([Sale Date]) as MaxDate
    
Resident ShopSales;
    
    
LET varMaxDate=Peek('MaxDate',0,'Temp');
    
    
DROP Table Temp;
    
    
Concatenate(ShopSales)
    
LOAD [Shop No],
            
[Product Code],
            
Quantity,
            
[Sale Date]   
    
From [QlikView Unlocked Data.xlsx]
     (
ooxml, embedded labels, table is [Shop Sale])
    
where [Sale Date]>'$(varMaxDate)';
    
ENDIF

Store ShopSales into ShopSales.qvd (qvd);
 
The first thing to do is to identify, whether the holding QVD file exists. The first time this script is run, it won’t exist, so a full load of the data is needed from the source. Also, if the integrity of the QVD file is questionable, it could not be manually deleted and the whole file will be refreshed the next time the document is reloaded. If the QVD file doesn’t exist, the whole source file is read, as it would have happened without using incremental load.

The real incremental load is performed after the ELSE statement. QlikView will first read the previous QVD file, and as there is no manipulation of data taking place, the optimized load process will be used.

Now, it is necessary to establish the current end point of the existing data, this uses the field identified earlier as a maker field; in this instance, let’s use [Sale Date]. This is done by creating a temporary label with a single record of the maximum value of the field. The content of this record is read and stored into a variable using the peek() function. The table can be dropped when finished.

Now, we are ready to read the source file again, but this time, we will add concatenate() at the beginning and a where clause at the end. Concatenate tells QlikView to add data to the existing table, and the WHERE clause specifies the new records.

Finally store the table in a local QVD file ready for the next time.

So far, we looked at data sources where we added data, but often there are situations where a lot of data is modified or records are removed. This creates even more challenges for incremental loads as existing records in the QVD have to be changed or removed.

With the previous method, it was necessary to identify a marked field to do the load from, but now we need to be able to identify which fields have been changed too. The only really safe method here is to have a timestamp field of when the record was created or modified. This way, it is possible to identify the records that need to be added or changed effectively, and more importantly, reliably.

Although there may be more script and complexity involved in using the incremental load techniques, the time saved while reloading a document can be enormous, especially as the source data grows.

The main point to remember is that you must build your script the possibly that the QVD file no longer exists, and you need to do a full load. This is essential for the times when you suspect the integrity of the QVD file is accidently deleted, or there are major changes to the source table.

To further understand the topic there is an excellent document available on Qlik website, that describes the process and has example scripts to use.

The link is https://community.qlik.com/docs/DOC-4597

Performance


There are other ways to speed up the load script, but sometimes, you have to think outside the box. As every server is different, there is no single easy answer to improving performance. Processor speed, memory capacity, disk configuration and network speed all have an important role to play. It is worth trying different approaches on your system to find what works best.


Once you have loaded a table into memory, should you need to use it again say, as part of ETL process, it is assumed that the use of the resident option would be the logical choice.

However, if the source of the table is QVD file stored on a fast local drive, it can be lot faster reading it again from the QVD file. This may have something to do with buffering, but we can found that, in many instances it is worth trying.

If creating another copy of a table, again as part of the ETL process, you would normally create a new table using a resident option and drop the original table when finished. However, if the table is large and there’s not enough free RAM in the server, it will start paging and performance will rapidly deteriorate. An alternative is to use store to store the table into a temporary QVD file, use Drop to drop the table; and then read the temporary QVD file from disk. This way, only one table is in memory at a time, and the time taken to save the table to disk can easily outweigh the time spent paging.

These are just a few ways to increase performance; there are many more approaches which we will have to try. 

Comments

Popular posts from this blog

Qlik Sense Data Reduction (Dynamic)

Understanding Joins

What is Synthetic Key?