1. Disk access characteristics

In this assignment, we investigate the data access characteristics of secondary storage devices. In the end, you will see the need of block based data transfer between main memory and the hard drive.

Refer to the hints for some useful information on C.

2. Sequential write to file

2.1. Random file generation

Implement a function:

/**
 * populate a random array (which is already
 * allocated with enough memory to hold n bytes.
 */
void random_array(char *array, long bytes);

random_array() populates the array with random characters from A to Z inclusively.

Now, use this function to implement the command create_random_file with the following syntax. Your program should allocate a fixed amount of memory char buffer[block_size], and repeatedly generate random content into buffer, and then write buffer to disk. A reasonable block_size is 1MB.

$ create_random_file <filename> <total bytes> <block size>

Your program will need to report the time it took to write the random bytes to the file using the specified block size.

2.2. Experiments

Write a script to experiment with different block sizes, and record the respective write data rate in bytes/second. You can use any scripting language (e.g. python, bash). Your scripts MUST run on CDF. Your script has to measure the rate for 10 different block sizes in the range of 100B to 3MB

You have to prepare a very short report for the following three sections:

Plot the observation of write data rate versus block size. Provide a simple explanation of the observation.
Discuss the existence of the optimal block size for write.
Try it on different machines and different storage medium (hard drive, solid state drive, USB storage). Compare the plots.pick at least 2 different medium

3. Sequential read from file

In this section, you are to experiment with the performance of sequential scan with different read block sizes.

3.1. Computing histogram using block read

You are to implement a function which scans through the file in blocks, and compute the distribution of different letters, i.e., count the occurrances of each letter: `A` to `Z`. Your function returns the status code.

/**
 * file_ptr : the file pointer, ready to be read from.
 * hist: an array to hold 26 long integers.  hist[0] is the
 *       number of 'A', and hist[1] is the number of 'B', etc.
 * block_size: the buffer size to be used.
 * milliseconds: time it took to complete the file scan
 * total_bytes_read: the amount data in bytes read
 *
 * returns: -1 if there is an error.
 */
int get_histogram(
    FILE *file_ptr, 
    long hist[], 
    int block_size, 
    long *milliseconds, 
    long *total_bytes_read);

Here is a sample of how your function can be used:

long hist[26];
long milliseconds;
long filelen;
FILE *file_ptr = fopen("mybigfile.txt", "r");

/**
 * Compute the histogram using 2K buffers
 */
int ret = get_histogram( file_ptr, 
                         hist, 
                         2 * 1024,
                         &milliseconds,
                         &filelen);
assert(! ret < 0)

printf("Computed the histogram in %d ms.\n", milliseconds);
for(int i=0; i < 26; i++) {
    printf("%c : %d\n", 'A' + i, hist[i]);
}
printf("Data rate: %f Bps\n", (double)filelen/milliseconds * 1000);

Create an executable with the arguments:

$ get_histogram <filename> <blocksize>

Your program is to display the histogram, the block size used, the total number of bytes read from the file, and the time took in milliseconds. This is the sample printout of the program:

A 43948
B 48299
C 43424
D 23852
...
Z 48603
BLOCK SIZE 1048576 bytes
TOTAL BYTES 104857600 bytes
TIME 72 milliseconds

3.2. Experiment

Write a script to experiment with different block sizes, and record the read data rate in bytes per second. You can use any scripting language (e.g. python, bash). Your scripts MUST run on CDF machines. Your script has to measure the rate for 10 different block sizes in the range of 100B to 3MB

You have to prepare a short report for following three sections:

Plot the observed read data rate versus block size. Explain the curve.
Discuss the existence of an optimal block size for read. Compare this with the case for write.
Run it on different machines and different medium. Compare the plots. pick at least 2 medium

4. Organize your code

Makefile

CC = g++
library.o: library.cc library.h
    $(CC) -o library.o -c library.cc

create_random_file: create_random_file.cc library.o
    $(CC) -o \$@ $< library.o

get_histogram: get_histogram.cc library.o
    $(CC) -o \$@ $< library.o

5. Hints and facts about C

Generating random characters:

#include <stdlib.h>

...
char c = 'A' + (rand() % 26);
...

Measuring system time in the units of milliseconds

#include <sys/timeb.h>
...
struct timeb t;
ftime(&t);
long now_in_ms = t.time * 1000 + t.millitm;
...

Perform block based write to disk

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

...

long block_size = ...;
char *buf = ...

FILE *fp = fopen(filename, "w");
fwrite(buf, 1, block_size, fp);
fflush(fp);

...

fclose(fp);

Note:

Line 11: fwrite(buf, itemsize, itemcount, fileptr) writes in total itemsize * itemcount bytes to fileptr from buf.
Line 12: fflush(fileptr) forces a disk write. Modern operating system may have its own buffer manager, so your write may be deferred in the cache.

Perform block based read from disk

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

long block_size = ...;
char *buf = ...
FILE *fp = fopen(filename, "r");

...

bzero(buf);
fread(buf, 1, block_size, fp)

...

fclose(fp);

6. Deliverables

A tar ball in .tar.gz named a1.tar.gz that includes all project files i.e. c/c++ files, makefile, scripts, plots, report. Make sure that you can tar/untar your file on CDF successfully. if we can not untar the file, your assignment will not be marked.

A maximum 4 page report in pdf format.