Basic index structures

CSC 443

1. Motivation for indexing

Heap file supports sequential scan of records. Theoretically, this is sufficient to implement all query operators in SQL. However, the resulting efficiency would be impractically poor.

In this article, we discuss a number of simple techniques to improve the efficiency of record scan and record search. The common theme would be:

Create secondary data storage to minimize disk I/O for record scan and search.

2. Sequential file

A sequential file is a heap file with the properties:

all records on a page are sorted by some key attributes, $K$. $$ \forall P,\ \forall r_i\in P,\ r_i[K]\leq r_{i+1}[K] $$
for all pages $P_i$ and $P_{i+1}$ in the heap file, we have that all records in $P_i$ are less than records in $P_{i+1}$ by the key $K$.

$$ \forall P_i,\ \forall r\in P_i,\ \forall r'\in P_{i+1},\ r \leq r' $$

Thus, a sequential file is with all the records sorted (by key attributes $K$) over the entire heap file.

This one to apply binary search when looking up record(s) by their sorted attribute $K$.

Recall the design of the heap file, to look for a record $r$ such that $r[K] = \vec{v}$, we would load page pointers from the heap page directory, and perform binary search by fetching the pages.

The total number of disk I/O required in the worst case is:

$$ \log_2(N_\mathrm{data\ pages}) + (N_\mathrm{directory\ pages})$$

Sequential file does not address the inefficiency in record scan.

3. Dense indexes

Dense indexes are secondary storage which stores redundant data in order to improve the efficiency of query processing.

Consider the case that a data file already exists.

Let $K = (k_1, k_2, \dots, k_n)$ be a key. A dense index is a sequential file whose records only contain the values over $K$ and a pointer to the address of the data record in the data file:

For each record $r\in\mathrm{Data\ File}$, we have: $$ r_\mathrm{index} = (r[k_1], r[k_2], \dots, r[k_n], \mathrm{address}(r\ \mathrm{in\ Data\ File})) $$ in the index file.

The motivation is that there may be a significant reduction in the record size in the index file when compared to the records in the data file. Thus, each index page contains much more records. Sequential scan through the index file would refer far less disk I/O since there are fewer pages to be retrieved.

4. Sparse indexes

To combine the strengths of sequential files and dense indexes, sparse indexes support binary search for fast record lookup by storing its records sorted by some key $K$, and it further reduces disk I/O required.

To apply sparse indexes, the data file must be a sequential file, namely data records are sorted by a key $K$.

The sparse index only stores the key values and address for the first record of each data page in the data file.

For each data page $P\in\mathrm{Data\ File}$, let $r$ be the first record in $P$, we have an index record in the sparse index: $$ (r[k_1], r[k_2], \dots, r[k_n], \mathrm{address}(P)) $$

This reduces the number of index pages in the sparse index. Searching for records with $r[K]=\vec v$ using a sparse index, we need to:

search in the sparse index for the record $r$ such that: $$r^* = \mathrm{argmax}_r r[K]\leq \vec{v}$$ Binary search can be used.
search in the page pointed by $r^*$ for $\{r: r[K] = \vec{v}\}$.

Assumption: for simplicity, we assume that we only need to retrieve the first occurance of a record $r$ where $r[K] = \vec{v}$.

5. Multilevel sparse indexes: tree indexes

Sparse indexes require the data file to be sorted (i.e. a sequential file), and supports record search with: $$ \log_2(N_\mathrm{index\ pages}) + 1 $$ number of disk I/Os.

$\log_2(N_\mathrm{index\ pages})$ page reads are needed to locate the index page.
1 page read to load the data page.

It turns out that we can improve search even more by making the observation that the sparse index file is a sequential file. So, itself can be indexes by yet another sparse index.

Let $D$ be the data file, $I1$ the first level sparse index, and $I2$ the second level sparse index. Let $N_D$, $N_{I1}$, and $N_{I2}$ be the number of pages respectively.

It is the case that: $$N_D > N_{I1} > N_{I2}$$

The number of disk I/Os required for record search is:

$$ \log_2(N_{I2}) + 2 $$

$\log_2(N_{I2})$ page reads are used to locate index page in $I_2$.
1 page read to load the index page in $I_1$
1 page read to load the data page.

6. Indexed sequential access method (ISAM)

6.1. Limitation of multilevel sparse index

Multilevel sparse index can be applied successively. The number of index pages is reduced at each level, until at the top level, there should be only one index page.

While the multilevel sparse index structure is efficient for record access, unfortunately, it cannot easily support record inserts and updates. Recall sparse indexes can only index sorted sequential file. So any record insert and update (on the sorting attributes) must maintain the ordering. Heap file only supports efficient record appends without any concerns on the ordering of the records.

6.2. ISAM

Indexed Sequential Access Method (ISAM) is an indexing technique based on multilevel sparse index, but it overcomes the limitation of multilevel sparse index, and supports fast record insert and update.

The modification is that each data page in the data file can have any number of overflow pages. When inserting a record $r$,

locate data page $P_i$ such that $P_i[0][K]\leq r[K] < P_{i+1}[0][K]$.
if $P_i$ has free space, insert $r$ into $P_i$.
if there is an overflow page, $Q$, of $P_i$ that has free space, insert $r$ into $Q$.
if all the overflow pages are also full, create a new overflow page to store $r$.

Note: The record (25, ...) is inserted after the index is built. Since there is no space available in the page(s) between the key 20 and 30, ISAM must create a new page which is appended into the appropriate linked list. The insertion of (25, ...) creates an overflow page.

6.3. Limitation of ISAM

ISAM relaxes on the sortedness of records. Consequently, the worst case performance for record retrieval becomes

$$ \log_2(N_\mathrm{index\ page}) + \max(N_\mathrm{overflow\ page}) $$

The cause of the limitation of ISAM is that the multilevel sparse indexes are static in the presence of data updates.

We will discuss the index structure, B+ tree, which addresses the performance degradation of ISAM by means of dynamic tree balancing.