1. Relational data layout on disk

In this assignment, we are to implement a library containing functions to store and maintain relational data on disk using the heap file data structure. We are asking you to implement and experiment two different file formats corresponding to a row store and a column store.

2. Record serialization

For simplicity, we assume that records are maps mapping attribute names to values. The attribute names are stored as part of the schema, which will not be stored as part of the record serialization.

Therefore, we can abstract records as a tuple of values.

#include <vector>
typedef const char* V;
typedef std::vector<V> Record;

You need to implement serialization of fixed length records.

/**
 * Compute the number of bytes required to serialize record
 */
int fixed_len_sizeof(Record *record);

/**
 * Serialize the record to a byte array to be stored in buf.
 */
void fixed_len_write(Record *record, void *buf);

You will also need to implement the deserialization functions.

/**
 * Deserializes `size` bytes from the buffer, `buf`, and
 * stores the record in `record`.
 */
void fixed_len_read(void *buf, int size, Record *record);

2.3. Experiments

We assume that there is only one table schema. There are 100 attributes, and each attribute is 10 bytes each. So, records in the table are fixed length.

  • Calculate the size of fixed length serialization of records in the table.

  • Use fixed_len_sizeof() to check if it agrees with your calculation.

3. Page layout

Recall that it's critial for all disk I/O to be done in units of blocks, known as pages. In this section, we experiment with storing serialized records in pages.

3.1. Storing fixed length records in pages

Use slotted directory based page layout to store fixed length records.

typedef struct {
    void *data;
    int page_size;
    int slot_size;
} Page;
/**
 * Initializes a page using the given slot size
 */
void init_fixed_len_page(Page *page, int page_size, int slot_size);

/**
 * Calculates the maximal number of records that fit in a page
 */
int fixed_len_page_capacity(Page *page);

/**
 * Calculate the free space (number of free slots) in the page
 */
int fixed_len_page_freeslots(Page *page);

/**
 * Add a record to the page
 * Returns:
 *   record slot offset if successful,
 *   -1 if unsuccessful (page full)
 */
int add_fixed_len_page(Page *page, Record *r);

/**
 * Write a record into a given slot.
 */
void write_fixed_len_page(Page *page, int slot, Record *r);

/**
 * Read a record from the page from a given slot.
 */
void read_fixed_len_page(Page *page, int slot, Record *r);

Use these functions to implement the following executables.

Load as many records from a comma separated file to fill up a page, and append the page to a file. Repeat until all the records in the CSV files are written to the page file. Your program should follow the following syntax, and produce the output containing record count, page count, and time took, similar to as follows:

$ write_fixed_len_pages <csv_file> <page_file> <page_size>

NUMBER OF RECORDS: 1000
NUMBER OF PAGES: 32
TIME: 43 milliseconds

Write another program to load the page_file, and print out all records in the page in CSV format.

$ read_fixed_len_page <page_file> <page_size>

3.2. Experiment

File generator

You can use this simple csv generator to generate test inputs.

  • Plot the performance (records / second) versus page size for write and read.

  • Discuss why page based format is superior to storing records using a CSV file.

  • Discuss the short comings of the way we organize pages.

4. Heap file

Finally, we are able to build the code to generate and maintain heap files.

Again, we assume a fixed table schema. All records have 100 attributes, and the values of each attribute are 10 bytes.

A heap file is just a paginated file. Each page is to store a series of records.

typedef struct {
    FILE *file_ptr;
    int page_size;
} Heapfile

We assume the following way to assign unique identifiers to records in the heap file:

  • Every page $p$ has an entry in the heap directory of (page_offset, freespace). The page ID of $p$ can be the index of its entry in the directory. We call this: $\mathrm{ID}(p)$.

  • Every record $r$ is stored at some slot in some page $p$. The record ID, $\mathrm{ID}(r)$ is the contenation of $\mathrm{ID}(p)$ and the slot index in $p$.

You can use the following structures to abstract page ID and record ID.

typedef int PageID;

typedef struct {
    int page_id;
    int slot;
} RecordID;

4.1. Heap file functions

We are to implement a directory based heap file in which we have directory pages (organized as a linked list), and data pages that store records.
/**
 * Initalize a heapfile to use the file and page size given.
 */
void init_heapfile(Heapfile *heapfile, int page_size, FILE *file);
/**
 * Allocate another page in the heapfile.  This grows the file by a page.
 */
PageID alloc_page(Heapfile *heapfile);
/**
 * Read a page into memory
 */
void read_page(Heapfile *heapfile, PageID pid, Page *page);
/**
 * Write a page from memory to disk
 */
void write_page(Page *page, Heapfile *heapfile, PageID pid);
The central functionality of a heap file is enumeration of records. Implement the record iterator class.
class RecordIterator {
    public:
    RecordIterator(Heapfile *heapfile);
    Record next();
    bool hasNext();
}

4.2. Heap file operations

Write the following executables to allow basic operations on the heap file. Your program must rely on page based disk I/O.

# Build heap file from CSV file
$ csv2heapfile <csv_file> <heapfile> <page_size>

# Print out all records in a heap file
$ scan <heapfile> <page_size>

# Insert all records in the CSV file to a heap file
$ insert <heapfile> <csv_file> <page_size>

# Update one attribute of a single record in the heap file given its record ID
# <attribute_id> is the index of the attribute to be updated (e.g. 0 for the first attribute, 1 for the second attribute, etc.)
# <new_value> will have the same fixed length (10 bytes)
$ update <heapfile> <record_id> <attribute_id> <new_value> <page_size>

# Delete a single record in the heap file given its record ID
$ delete <heapfile> <record_id> <page_size>

4.3. Experiment

Measure the performance of csv2heapfile, comment on how the page size affects the performance of load.

Write an executable:

# Select single attirbute with parameters <start> and <end>
# <attribute_id> is the index of the attribute to be selected (e.g. 0 for the first attribute, 1 for the second attribute, etc.)
$ select <heapfile> <attribute_id> <start> <end> <page_size>

that performs the following parametrized SQL query:

        SELECT SUBSTRING(A, 1, 5) FROM T
        WHERE A >= start AND A <= end
  • Measure the performance of the query versus page size.
  • Comment on the choice of page size and the effects of the range from start and end on the performance of the query.

5. Column Store

A column-oriented DBMS (or Column Store), stores all values of a single attribute (column) together, rather than storing all values of a single record together. Hence, the main abstract they use is of columns-of-data rather than rows-of-data. Most DBMS are row-oriented, but depending on the workload, column-oriented DBMS may provide better performance. Below is an illustration of column store (left) and row-store (right).

Consider a query that returns all values of the attribute Name. In a column store you only need to retrieve Name values from the file. In constract, in when using row-oriented storage, to find the Name value of a record, you are necessarily retrieving all attribute values in the record. Hence, to return all values of the Name attribute, you need to read the entire table. The drawback of column store is you that you need to do extra work to reconstruct the tuple. Suppose you wish to return the Name and Salary values from a table. To be able to reassemble records, a column-oriented DBMS will store the tuple id (record-id) with each value in a column.

Column-oriented storage has advantages for queries that only access some of the attributes of a table. However, insertion and deletion now may require multiple page accesses as each tuple is no longer stored on a single page

For this assignment, we will implement a simplified version of column store using your existing heap file implementation. Our implementation will have a separate heap file for each column of a table.

We will use the same fixed table schema (100 attributes, 10 bytes each). For each attribute, you need to create a separate heap file. You can name the heap file with the same name as the attribute id. You should put all attribute heap files in a single file directory. This is a simplification for this assignment to make the bookkeeping on what files are in a relation simpler. Think about the limitations of the simplification.

Tuple reconstruction: Different attributes of a tuple will be in different heap file. So, we need to reconstruct the tuple (part of the tuple) to get the result of a query. We can store the tuple-id with each field. Two attributes will have the same tuple-id if they belong to the same tuple.

5.1. Column-Oriented file operations

Implement an executable to create a column store from a CSV file:

# Build a column store from CSV file
# <colstore_name> should be a file directory to store the heap files
$ csv2colstore <csv_file> <colstore_name> <page_size>

Implement an executable:

# Select a single attribute from the column store where parameter
# <attribute_id> is the index of the attribute to be project and returned (e.g. 0 for the first attribute, 1 for the second attribute, etc.)
# the value of the attribute must be between <start> and <end> 
$ select2  <colstore_name> <attribute_id> <start> <end> <page_size>

that performs the following parametrized SQL query. Note that the selection predicate is on the same attribute that is returned by the query.

    SELECT SUBSTRING(A, 1, 5) FROM T
    WHERE A >= start AND A <= end

Implement an executable:

# Select only a single attribute from the column store where parameter
# <return_attribute_id> (B) is the index of an attribute to be projected and returned 
# <attribute_id> (A) is the index of (a possibly different) attribute whose value must be between  <start> and <end> 
$ select3  <colstore_name> <attribute_id> <return_attribute_id> <start> <end> <page_size>

that performs the following parametrized SQL query:

    SELECT SUBSTRING(B, 1, 5) FROM T
    WHERE A >= start AND A <= end

5.2. Experiment

  • Measure the performance of csv2colstore against different page sizes.

  • Compare the result with that of csv2heapfile in the previous section. Comment on the difference.

  • Compare the performance of select2 with that of select in the previous section. Comment on the difference.

  • Compare the performance of select3 with that of select and select2. Comment on the difference especially with respect to different selection ranges (different values of start and end).

6. Organize your submission

# Makefile
CC = g++ 

library.o: library.cc library.h
    $(CC) -o \$@ -c $<

csv2heapfile: csv2heapfile.cc library.o
    $(CC) -o \$@ $< library.o

scan: scan.cc library.o
    $(CC) -o \$@ $< library.o

insert: insert.cc library.o
    $(CC) -o \$@ $< library.o

update: update.cc library.o
    $(CC) -o \$@ $< library.o

delete: delete.cc library.o
    $(CC) -o \$@ $< library.o

select: select.cc library.o
    $(CC) -o \$@ $< library.o

csv2colstore: csv2colstore.cc library.o
    $(CC) -o \$@ $< library.o

select2: select2.cc library.o
    $(CC) -o \$@ $< library.o

select3: select3.cc library.o
    $(CC) -o \$@ $< library.o

7. Submission

Submit your code as .tar file to Markus