written 7.8 years ago by |
Indexing OLAP data:
Requirements on an indexing method
1. Symmetric partial match queries:
- Most of the OLAP queries can be expressed conceptually as a partial range query where associated with one or more dimensions of the cube is a union of range of values and we need to efficiently retrieve data corresponding to this range.
- The extreme case is where the size of the range is one for all dimensions giving us a point query. The range could be continuous, for instance, time between Jan '94 to July '94" or discontinuous, for instance, first month of every year" and product IN (soap, shirts, shoes).
2. Indexing at multiple levels of aggregation :
Most OLAP databases pre-compute multiple groupbys corresponding to different levels of aggregations of the base cube. For instance, groupbys could be computed at the <product-store> level, <product-time> and <time> level for a base cube with dimensions product, store and time. It is equally important to index the summarized data.
An issue that arises here is whether to build separate index trees for dierent levels of aggregation or whether to add special values to the dimension and index precomputed summaries along with the base level data. For instance, if we index sales for <product-year>, then we can store total-sales at the <product> level by simply adding an additional value for the year dimension corresponding to "ALL" years and storing total-sales for each product there.
3. Multiple traversal orders:
- B-trees are commonly used in OLTP systems to retrieve data sorted on the indexed attribute. This is often a cheaper alternative to doing external sorts. OLAP databases, because of the large number of group-bys that they perform, can also benet from using indices to sort data fast. The challenge is in allowing such a feature over multiple attributes instead of a single one and also for any permutation of any subset of the attributes.
4. Efficient batch update:
- OLAP databases have the advantage that frequent point updates as in OLTP data is uncommon. However, the update problem cannot be totally ignored. Making batch updates efficient is absolutely necessary. It is not uncommon for multinational organizations to update data as high as four times a day since daily data from dierent locations of the world appear at different times. On the other hand, these updates are clustered by region and time. This property can be exploited in localizing changes and making updates faster.
5. Handle sparse data:
- Colliat in [Col96] states that typically 20% of the data in the logical OLAP cube are non-zero. However, as the OLAP model is nding newer applications, it is desirable to have an indexing method that is not tied to any xed notions of sparsity. Therefore, the ideal method should scale well with increasing sparsity.
Existing methods
1. Multidimensional array-based methods:
Logically, the OLAP data cube can be viewed as a multidimensional array with the key attributes forming the axis of the array. The ideal indexing scheme for this logical view of the data would have been a multidimensional array if the data cube were dense. Any exact or range query on any combination of attributes could have been easily answered by algebraically computing the right offsets and fetching the required data.
Consider an example of a four dimensional cube where product and store are identied as sparse dimensions and time and scenarios belong to the dense set. A B-tree index is built on the productstore pair and for each pair of values where data is present, a 2-dimensional array of time and scenarios is stored.
2. Bit-mapped indices and variations:
When data is sparse, a good option is not to index the multidimensional data space but index each of the dimension space separately as in bit-mapped indices. This is a popular method used by several vendors. Different vendors have dierent variants of the basic method ..
Each dimension of the cube has associated with it a bit-mapped index. In the simplest form, a bit-mapped index is a B-tree where instead of storing RIDs for each key-value at the leaf, we store a bit-map. The bit-map for value "v" of attribute \A" is an array of bits where each bit corresponds to a row of the fact table (or a non-empty cell of the data cube). The bit is a \1" only at those positions where the corresponding row has value "v" for attribute A.
Major advantages of this method is that:
(1) for low cardinality data, bit maps are both space and retrieval effiicient. Bit-operations like AND/OR/NOT/COUNT are more ecient than doing the same operations on RID lists.
(2) Access to data is clustered since the bit-map order corresponds to the data storage order.
(3) All dimensions are treated symmetrically and sparse data can be handled the same way as dense data.
(4) If required, data can also be retrieved in any arbitrary sorting order of dimensions by traversing the bit-maps in a certain order. However, using the indices to retrieve data in a particular order, can result in a loss of the clustered data access property .
The ma jor disadvantages of bit-mapped indices is:
(1) ORing bit maps for range queries might be expensive. The number AND operations is limited to the number of dimensions and is therefore small in most cases. However, the number of OR operations can be large for many queries since each value in a range of a dimension will incur a OR operation.
(2) the increased space overhead of storing the bit-maps especially for high cardinality data.
(3) Batch updates can also be expensive since all bit-map indices will have to modified for even a single new row insertion.
3. Compression:
- Bit-maps are often compressed to reduce space overhead. But, this implies that the overhead of decompression has to be incurred during retrieval or one has to rely on methods of doing \AND" and \OR" operations on compressed bit-maps . For simple compression schemes like run-length encoding it is easy to design AND/OR algorithms that work on compressed data. However, it is necessary to keep the cost of these operations low since one of the main reasons for using bit-mapped indices is that it enable fast bit operations.
4. Hybrid methods:
Since bit-maps are not appropriate for high cardinality data, some products [Ede95] follow a hybrid approach where a plain B-tree is used when the list of qualifying RIDs per entry is small, otherwise a bit-mapped index is used.