Performance and Incremental loads
Incremental loads
All data sources grow over the period and when dealing with
large volume of data, time it takes to reload data increases along with growth
of data.
Incremental load is a technique whereby data is read from
source and stored on a local drive as a QVD file. Subsequent loads are first
done from the QVD file, then only the new (or updated) data is read from the
source and finally saved again as a replacement QVD file ready for the next
time.
There are couple of techniques that can be used to
accomplish an incremental load, and which of this to use depends on a couple of
factors you need to establish.
Firstly, establish where the data is coming from, is it a
text file (Such as log file), a spreadsheet or a database table?
Secondly, is data just added to the source or is it
uploaded? Or are there records possibly deleted too?
The simplest from is when reading a text file (Such as
system log), where data is only added to the end. For this, there is simple
prefix to the QlikView load instruction.
The following script will read the entire QlikView session
log.
Sessionlog:
LOAD *
LOAD *
FROM
[C:\ProgramData\QlikTech\QlikViewServer\Sessions_QLIKVIEWSRV.log]
(txt, utf8, embedded labels, delimiter is '\t', msq);
[C:\ProgramData\QlikTech\QlikViewServer\Sessions_QLIKVIEWSRV.log]
(txt, utf8, embedded labels, delimiter is '\t', msq);
Every time the document is reloaded, the entire file will be
read, however we could change it to the following.
Sessionlog:
Buffer(Incremental)
LOAD *
FROM
[C:\ProgramData\QlikTech\QlikViewServer\Sessions_QLIKVIEWSRV.log]
(txt, utf8, embedded labels, delimiter is '\t', msq);
[C:\ProgramData\QlikTech\QlikViewServer\Sessions_QLIKVIEWSRV.log]
(txt, utf8, embedded labels, delimiter is '\t', msq);
You will notice the
addition of a single line Buffer(Incremental) which tells QlikView to create its own temporary file the
first time the file is read and to remember the size of the source file. Next
time it will read the QVD file first and add to it anything from that saved
point onwards before finally saving the revised QVD file and pointer.
This method works well
and is easy to implement but can only be used for simple text files (Including
CSV) where data is only appended.
If data is coming from
more complex source, such as database table or spreadsheet, a different approach
is needed. First of all, it is necessary to establish whether there is a field
that can be used as a marker. Ideally there is an autonumber record ID that is
increased with every record, a timestamp field when a record is written (or
updated), or a data field such as order created.(Care must be taken here as performing
multiple refreshes in one day could cause issue with duplicates.)
If you cannot identify a suitable field, the incremental
load approach is not possible.
In this example, it is necessary to write lot more script to
handle this approach as these needs to be resilience for time when total
refresh is required or for the first time the document is loaded.
Take a look at the following script.
If IsNull(FileTime('ShopSales.qvd'))
then
ShopSales:
LOAD [Shop No],
[Product Code],
Quantity,
[Sale Date]
FROM [QlikView Unlocked Data.xlsx]
(ooxml, embedded labels, table is [Shop Sale]);
ELSE
ShopSales:
LOAD * From 'ShopSlaes.qvd' (qvd);
Temp:
LOAD Max([Sale Date]) as MaxDate
Resident ShopSales;
LET varMaxDate=Peek('MaxDate',0,'Temp');
DROP Table Temp;
Concatenate(ShopSales)
LOAD [Shop No],
[Product Code],
Quantity,
[Sale Date]
From [QlikView Unlocked Data.xlsx]
(ooxml, embedded labels, table is [Shop Sale])
where [Sale Date]>'$(varMaxDate)';
ENDIF
Store ShopSales into ShopSales.qvd (qvd);
ShopSales:
LOAD [Shop No],
[Product Code],
Quantity,
[Sale Date]
FROM [QlikView Unlocked Data.xlsx]
(ooxml, embedded labels, table is [Shop Sale]);
ELSE
ShopSales:
LOAD * From 'ShopSlaes.qvd' (qvd);
Temp:
LOAD Max([Sale Date]) as MaxDate
Resident ShopSales;
LET varMaxDate=Peek('MaxDate',0,'Temp');
DROP Table Temp;
Concatenate(ShopSales)
LOAD [Shop No],
[Product Code],
Quantity,
[Sale Date]
From [QlikView Unlocked Data.xlsx]
(ooxml, embedded labels, table is [Shop Sale])
where [Sale Date]>'$(varMaxDate)';
ENDIF
Store ShopSales into ShopSales.qvd (qvd);
The first thing to do is to identify, whether the holding
QVD file exists. The first time this script is run, it won’t exist, so a full load
of the data is needed from the source. Also, if the integrity of the QVD file
is questionable, it could not be manually deleted and the whole file will be
refreshed the next time the document is reloaded. If the QVD file doesn’t
exist, the whole source file is read, as it would have happened without using
incremental load.
The real incremental load is performed after the ELSE
statement. QlikView will first read the previous QVD file, and as there is no
manipulation of data taking place, the optimized load process will be used.
Now, it is necessary to establish the current end point of
the existing data, this uses the field identified earlier as a maker field; in
this instance, let’s use [Sale Date]. This is done by
creating a temporary label with a single record of the maximum value of the
field. The content of this record is read and stored into a variable using the peek()
function. The table can be dropped when finished.
Now, we are ready to read the source file again, but this
time, we will add concatenate() at the beginning and a where
clause at the end. Concatenate tells QlikView to add data to the existing table,
and the WHERE clause specifies the new records.
Finally store the table in a local QVD file ready for the
next time.
So far, we looked at data sources where we added data, but
often there are situations where a lot of data is modified or records are
removed. This creates even more challenges for incremental loads as existing
records in the QVD have to be changed or removed.
With the previous method, it was necessary to identify a
marked field to do the load from, but now we need to be able to identify which
fields have been changed too. The only really safe method here is to have a
timestamp field of when the record was created or modified. This way, it is
possible to identify the records that need to be added or changed effectively,
and more importantly, reliably.
Although there may be more script and complexity involved in
using the incremental load techniques, the time saved while reloading a
document can be enormous, especially as the source data grows.
The main point to remember is that you must build your
script the possibly that the QVD file no longer exists, and you need to do a
full load. This is essential for the times when you suspect the integrity of
the QVD file is accidently deleted, or there are major changes to the source
table.
To further understand the topic there is an excellent document
available on Qlik website, that describes the process and has example scripts to
use.
The link is https://community.qlik.com/docs/DOC-4597
Performance
There are other ways to speed up the load script, but
sometimes, you have to think outside the box. As every server is different,
there is no single easy answer to improving performance. Processor speed,
memory capacity, disk configuration and network speed all have an important
role to play. It is worth trying different approaches on your system to find
what works best.
Once you have loaded a table into memory, should you need to
use it again say, as part of ETL process, it is assumed that the use of the resident
option would be the logical choice.
However, if the source of the table is QVD file stored on a
fast local drive, it can be lot faster reading it again from the QVD file. This
may have something to do with buffering, but we can found that, in many instances
it is worth trying.
If creating another copy of a table, again as part of the
ETL process, you would normally create a new table using a resident
option and drop the original table when finished. However, if the table is
large and there’s not enough free RAM in the server, it will start paging and
performance will rapidly deteriorate. An alternative is to use store
to store the table into a temporary QVD file, use Drop to drop the
table; and then read the temporary QVD file from disk. This way, only one table
is in memory at a time, and the time taken to save the table to disk can easily
outweigh the time spent paging.
These are just a few ways to increase performance; there are
many more approaches which we will have to try.
Comments
Post a Comment