BLOG

Columnstore indexes - Data loading

20-08-2019  0 Comment(s)

Options and recommendations for loading data into a columnstore index by using the standard SQL bulk loading and trickle insert methods. Loading data into a columnstore index is an essential part of any data warehousing process because it moves data into the index in preparation for analytics.

What is bulk loading?
Bulk loading refers to the way large numbers of rows are added to a data store. It is the most performant way to move data into a columnstore index because it operates on batches of rows. Bulk loading fills rowgroups to maximum capacity and compresses them directly into the columnstore. Only rows at the end of a load that don't meet the minimum of 102,400 rows per rowgroup go to the deltastore.

To perform a bulk load, you can use bcp Utility, Integration Services, or select rows from a staging table.

As the diagram suggests, a bulk load:

  • Does not pre-sort the data. Data is inserted into rowgroups in the order it is received.
  • If the batch size is >= 102400, the rows are directly into the compressed rowgroups. It is recommended that you choose a batch size >=102400 for efficient bulk import because you can avoid moving data rows to a delta rowgroups before the rows are eventually moved to compressed rowgroups by a background thread, Tuple mover (TM).
  • If the batch size < 102,400 or if the remaining rows are < 102,400, the rows are loaded into delta rowgroups.

Bulk loading has these built-in performance optimizations:

Parallel loads: You can have multiple concurrent bulk loads (bcp or bulk insert) that are each loading a separate data file. Unlike rowstore bulk loads into SQL Server, you don't need to specify TABLOCK because each bulk import thread will load data exclusively into separate rowgroups (compressed or delta rowgroups) with exclusive lock on it.

Reduced Logging: The data that is directly loaded into compressed row groups leads to significant reduction in the size of the log. For example, if data was compressed 10x, the corresponding transaction log will be roughly 10x smaller without requiring TABLOCK or Bulk-logged/Simple recovery model. Any data that goes to a delta rowgroup is fully logged. This includes any batch sizes that are less than 102,400 rows. Best practice is to use batchsize >= 102400. Since there is no TABLOCK required, you can load the data in parallel.

Minimal logging: You can get further reduction in logging if you follow the prerequisites for minimal logging. However, unlike loading data into a rowstore, TABLOCK leads to an X lock on the table rather than a BU (Bulk Update) lock and therefore parallel data load cannot be done. For more information on locking, see [Locking and row versioning[(../sql-server-transaction-locking-and-row-versioning-guide.md).

Locking Optimization: The X lock on a row group is automatically acquired when loading data into a compressed row group. However, when bulk loading into a delta rowgroup, an X lock is acquired at rowgroup but SQL Server still locks the PAGE/EXTENT because X rowgroup lock is not part of locking hierarchy.

If you have a nonclustered B-tree index on a columnstore index, there is no locking or logging optimization for the index itself but the optimizations on clustered columnstore index as described above are applicable.

Sam Analytiks motivates, educates and proliferates data for any organisation as a non-profit partner. If you or any of your team member needs a help, surely our consultants will be glad to help you in any case.

Call us at +48-729473572 or email us at sam@samanalytiks.com
#data #datavisualisation #datamodelling #datamart #powerbi #sql #excel #powerquery #azure #azureanalysis 

 

Comment Here

Comments

No Comments to Show

WE ALWAYS WORK WITH :