1. Disk access characteristics
In this assignment, we investigate the data access characteristics of secondary storage devices. In the end, you will see the need of block based data transfer between main memory and the hard drive.
Refer to the hints for some useful information on C.
2. Sequential write to file
2.1. Random file generation
Implement a function:
/** * populate a random array (which is already * allocated with enough memory to hold n bytes. */ void random_array(char *array, long bytes);
random_array()
populates the array with random characters from
A to Z inclusively.
Now, use this function to implement the command create_random_file
with
the following syntax. Your program should allocate a fixed amount of memory
char buffer[block_size]
, and repeatedly generate random content
into buffer
, and then write buffer
to disk. A reasonable block_size
is 1MB.
$ create_random_file <filename> <total bytes> <block size>
Your program will need to report the time it took to write the random bytes to the file using the specified block size.
2.2. Experiments
-
Write a script to experiment with different block sizes, and record the respective write data rate in bytes/second. You can use any scripting language (e.g. python, bash). Your scripts MUST run on CDF. Your script has to measure the rate for 10 different block sizes in the range of 100B to 3MB
-
Plot the observation of write data rate versus block size. Provide a simple explanation of the observation.
-
Discuss the existence of the optimal block size for write.
-
Try it on different machines and different storage medium (hard drive, solid state drive, USB storage). Compare the plots.pick at least 2 different medium
You have to prepare a very short report for the following three sections:
3. Sequential read from file
In this section, you are to experiment with the performance of sequential scan with different read block sizes.
3.1. Computing histogram using block read
You are to implement a function which scans through the file in blocks, and compute the distribution of different letters, i.e., count the occurrances of each letter: `A` to `Z`. Your function returns the status code./** * file_ptr : the file pointer, ready to be read from. * hist: an array to hold 26 long integers. hist[0] is the * number of 'A', and hist[1] is the number of 'B', etc. * block_size: the buffer size to be used. * milliseconds: time it took to complete the file scan * total_bytes_read: the amount data in bytes read * * returns: -1 if there is an error. */ int get_histogram( FILE *file_ptr, long hist[], int block_size, long *milliseconds, long *total_bytes_read);Here is a sample of how your function can be used:
long hist[26]; long milliseconds; long filelen; FILE *file_ptr = fopen("mybigfile.txt", "r"); /** * Compute the histogram using 2K buffers */ int ret = get_histogram( file_ptr, hist, 2 * 1024, &milliseconds, &filelen); assert(! ret < 0) printf("Computed the histogram in %d ms.\n", milliseconds); for(int i=0; i < 26; i++) { printf("%c : %d\n", 'A' + i, hist[i]); } printf("Data rate: %f Bps\n", (double)filelen/milliseconds * 1000);Create an executable with the arguments:
$ get_histogram <filename> <blocksize>Your program is to display the histogram, the block size used, the total number of bytes read from the file, and the time took in milliseconds. This is the sample printout of the program:
A 43948 B 48299 C 43424 D 23852 ... Z 48603 BLOCK SIZE 1048576 bytes TOTAL BYTES 104857600 bytes TIME 72 milliseconds
3.2. Experiment
-
Write a script to experiment with different block sizes, and record the read data rate in bytes per second. You can use any scripting language (e.g. python, bash). Your scripts MUST run on CDF machines. Your script has to measure the rate for 10 different block sizes in the range of 100B to 3MB
-
Plot the observed read data rate versus block size. Explain the curve.
-
Discuss the existence of an optimal block size for read. Compare this with the case for write.
-
Run it on different machines and different medium. Compare the plots. pick at least 2 medium
You have to prepare a short report for following three sections:
4. Organize your code
CC = g++ library.o: library.cc library.h $(CC) -o library.o -c library.cc create_random_file: create_random_file.cc library.o $(CC) -o \$@ $< library.o get_histogram: get_histogram.cc library.o $(CC) -o \$@ $< library.o
5. Hints and facts about C
- Generating random characters:
#include <stdlib.h> ... char c = 'A' + (rand() % 26); ...
- Measuring system time in the units of milliseconds
#include <sys/timeb.h> ... struct timeb t; ftime(&t); long now_in_ms = t.time * 1000 + t.millitm; ...
- Perform block based write to disk
#include <stdio.h> #include <stdlib.h> #include <string.h> ... long block_size = ...; char *buf = ... FILE *fp = fopen(filename, "w"); fwrite(buf, 1, block_size, fp); fflush(fp); ... fclose(fp);
Note:
- Line 11:
fwrite(buf, itemsize, itemcount, fileptr)
writes in totalitemsize * itemcount
bytes tofileptr
frombuf
. - Line 12:
fflush(fileptr)
forces a disk write. Modern operating system may have its own buffer manager, so your write may be deferred in the cache.
- Perform block based read from disk
#include <stdio.h> #include <stdlib.h> #include <string.h> long block_size = ...; char *buf = ... FILE *fp = fopen(filename, "r"); ... bzero(buf); fread(buf, 1, block_size, fp) ... fclose(fp);