Assignment two questions and answers

Here are some questions and answers about assignment two, and other notes. Suggestions for additions to this list are welcome (e.g. via e-mail).

[COLLAPSE ALL]

Most of all, remember to "Keep It Simple". A more complex program is more likely to have bugs; it takes you longer to write; it is harder to maintain; it is unlikely to be more usable. Most cutesy features are not helpful, and are measurably hurtful.

In pragmatic assignment-writing terms, cutesy features don't get you extra marks, but the probabilistic expectation is that they will lose you marks on average, because they will introduce bugs which affect the working of the non-cutesy parts of the program.

That is to say: Wield your cleverness cleverly. Don't waste your cleverness in doing silly things.

Two pithy quotations:

"The superior pilot uses his superior judgement to avoid situations in which he has to demonstrate his superior skill." (traditional pilot saying)
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." — Brian Kernighan

Some resources:

lstat.c and other such files in lab six
The command-line argument parsing for whatyear.c is supplied for you in /u/csc209h/summer/pub/a2/whatyear-starter.c
There are videos about structs, and a full description at ../notes/struct.html
man pages
lecture notes about argc and argv
lecture notes about files in C
simple cat program which takes zero or more files as command-line arguments in the usual way
lecture notes about strings in C
how to call getopt(), and optionally detailed notes about getopt()
The unix filesystem video may be helpful for question 3
Please try out sample solutions (compiled programs only, so that you can't see my sample solution source code before the due date!), in /u/csc209h/summer/pub/a2 .

More general advice:

Don't store text for any longer than or in any more of a complex data structure than you need to. Process it as you go.

"crypt" will work on the text character-by-character; it does not need to store the whole file, or even a whole line. It doesn't need to assemble the output line into a char array either; just output it as you go!

And similarly, findempty should process one file or subdirectory at a time, rather than storing the results of readdir() calls.

Unless you are avoiding having a pathname string length limit, findempty does not need to do any dynamic memory allocation, except for what you get for free with recursion.

Keep It Simple. My solutions have the following line counts:

	90 crypt.c
	90 whatyear.c
	72 findempty.c

including all #includes, use of getopt(), etc. (And the above whatyear.c includes the 49 lines of starter code, too.) (I might add a few further comments before posting the solutions, but not too many.)

Don't type "vigenere cipher in C" into Google.

First of all, this is a really bad way to write computer programs. To write a working computer program, you need to understand the problem; understand your tools; and figure out how to bring the tools to bear to solve the problem. Web searches of this type accomplish none of these. You will not end up with a working program.

Secondly, this is a really bad way to take a course. The objective of an assignment is not to "get the right answer". The objective is to learn how to write a particular program and to complete this task successfully, learning from the experience.

Thirdly, although there are a lot of hits for such a web search, there seem to be only two pieces of code out there — everyone else is plagiarizing. (Don't follow their lead in this regard!) Also, one of them is completely wrong. Furthermore, we will be comparing them to your code as part of our searching for academic offences. If you copy them and try to disguise them, your disguise will be inadequate — we've been doing this much longer than you have.

But most of all, this is a bad way to write computer programs. You need to understand everything in your .c file if you want to get it to work. Copying in stuff you don't understand is asking for trouble.

In general, don't check whether operations will succeed; just try to do them and get an appropriate error if applicable. For example, if you're about to fopen() a file, don't do a stat() and try to determine whether the file exists and/or is readable. Just do the fopen() and check for error. This results in a simpler program, and also one which functions more correctly in the invariable case that you have omitted checking something so you think it's going to succeed but it doesn't. And there can always be unexpected i/o errors, etc.

The assignment handout does not necessarily specify all of the details of the required behaviour of your programs in all cases. I've tried to specify most things, but generally speaking, your programs are required to behave like standard unix tools. To answer some questions I've provided compiled sample implementations in /u/csc209h/summer/pub/a2 .

For example, the "usage" messages have a very specific format, which you must adhere to. It is similar to the SYNOPSIS section of the man pages, with their meaning for square brackets (indicating that something is optional) and ellipses (indicating "one or more of"). You can take the usage messages from the behaviour of the example compiled programs in /u/csc209h/summer/pub/a2 if you like; and fairly little variation is acceptable, although the token immediately following the "usage:" string can be either argv[0] or the base program name. I wrote about usage messages at ../notes/tiny/usage.html

Check for possible error return from all system calls, and from fopen(). For any library call or kernel call which can return an error indication, you have to check it and do something appropriate, even if it's just printing an error message and exiting.

Error messages must be to stderr, not stdout. And pay attention to your process exit status.

And be sure you understand perror(). Where perror() is applicable, it is obligatory, rather than formulating your own error message. Please look at what perror() does in the example cat.c — perror() produces a better error message than you can. (And it does its output to stderr, as we would want.)

However, perror() is only suitable for reporting the error status from certain library and kernel calls. It can't be used for general error messages because it prints error messages in a specific format.

I am happy to interpret compiler error messages (for CSC 209 students). Sometimes the compiler will emit error messages which you might find cryptic. I won't fix your assignment code for you, but I will tell you more clearly what a particular error message means. I will, sometimes, fix non-assignment-related code (although more frequently I'll give you hints instead).

Q: When I compile my program (any one of the three) I get the following warning message: [...]
Is this ok?

A: No. Your program should compile with "gcc −Wall" with no warning or error messages. Almost all of the warning or error messages which gcc −Wall can output represent potentially-serious problems, and you need to fix them. I am willing to decode error messages by e-mail (although not generally to fix your bugs, obviously).

Standard indentation is required in your C programs.
(And you should assume that your reader's window might not be any more than 80 characters wide.)

Your program must not exceed array bounds no matter what the user input (or command-line arguments).

Many cases of programs I see at this point in this course which contain lurking bugs of this nature are actually copying data entirely unnecessarily. Don't copy data when the original is just as good as the copy. For example, strings in the argv array can be used from that array directly, without copying the string data.

Q: Can I put some functions in a separate .c file and submit that too?

A: No. Your programs for this assignment are small enough that it isn't worth it to separate them into multiple files.

Q: Can I submit a .h file so that I can declare some functions and/or variables?

A: No, just declare them at the top of your .c file (or wherever is appropriate). The purpose of .h files is to coordinate declarations across multiple files. Each of your files should be self-contained for this assignment.

Q: How do you print to the standard error?

A: Use fprintf(stderr, "format" ... ), or any other stdio function which accepts a value of type FILE*

Also, perror() prints its message to stderr.

Q: But when I do fprintf(stderr, "this is an error message\n"), I still see it on the screen.

A: Both stdout and stderr are initially connected to your terminal window, but they can be redirected independently.

If your program says fprintf(stderr, "this is an error message\n"), then if you run "./a.out >file", you'll still see "this is an error message" on the screen and it won't go into the file. This is the purpose of using stderr, as previously discussed.

Q: If one of the files or directories specified on the command line cannot be opened, should we exit immediately or do we have to continue on through the rest of the arguments like cat does?

A: You have to call perror(), and you have to exit with a non-zero exit status eventually. So the easiest thing is just to exit right away. In most cases it's ok (desirable, even) to process the remaining files which do exist, correctly; but it's not required for this assignment. You'll find that some standard unix tools proceed after error in this way and some don't.

The return type of getchar() and getc() is int, not char, and you can't store it in a char variable. With 8-bit chars, there are 257 possible return values: 0 to 255 indicating a byte of input (that's 256 possible values there), or −1 to indicate eof or error.

A value with 257 possible values cannot be stored in an 8-bit char. If you attempt to do so, e.g. if you have

	char c;
	while ((c = getc(fp)) != EOF) {

, then you won't be able to tell the input of byte number 255 apart from the EOF condition. (Either the comparison will fail in both cases or it will succeed in both cases, depending upon whether or not char values are deemed to be "signed" or "unsigned", both of which are legal for a C compiler.)

Once you've found that the value returned from getc() or getchar() is not equal to EOF, then it's safe to store in a char variable.

Q: What is the difference between using fopen(), getc() or fgets() or fscanf(), then fclose(); as opposed to using open(), read(), and close()? Which should we use?

A: Normally you should use the 'f' functions. (By which term I mean to include getc() — i.e. you should feel free to use getc().)

The 'f' functions (fopen(), getc()/etc, fclose()) are part of the standard i/o library, which was built on top of the unix kernel calls (open(), read(), close()) for two reasons:

portability: The low-level file primitives work(ed) differently on different operating systems, but the stdio functions were designed to be implementable on top of any of them.
buffering. When you do getc(), it uses read() to read a bunch of bytes, not just one, and then getc() gives you one at a time, until you exhaust the buffer and then it does another read(). Even on modern computers this makes a significant speed difference (for sufficiently-large files).

So even if you have a unix-specific program, you should use fopen() and friends for basic file-processing tasks where suitable. (Future-looking note: On the other hand, you should use open() for doing i/o redirection, as we'll talk about when we talk about unix processes in a few weeks, because you aren't going to read any data from the file, you're just about to dup and exec and stuff, there's no point in having the extra stdio stuff allocated. We'll see this later.)

(You can get the unix file descriptor underlying a FILE* with the fileno() function (that is, fileno(fp) is the file descriptor number). You can go the other way by using fdopen(), which creates all the FILE* stuff around an already-opened file identified by file descriptor number ("fd"). These two functions are rarely necessary and won't be of use to us in this course.)

C's "sizeof" operator does not give you the size of an array, in general. If you think there's no way to write a particular bit of code without using sizeof, then sizeof probably won't help you there, either. In particular,

void f(int *a)
{
    int i;
    for (i = 0; i < sizeof a; i++)   /* WRONG */
        ...
}

is completely wrong. It will not iterate the correct number of times. The variable 'a' will have size 8 on our linux machines, because that's how many bytes are used by a pointer. If you want to know the number of elements in the array which 'a' points to, you need to pass that value in as a second parameter, of type int.

Q: Various segfault or bus error problems ("Segmentation exception" or "Bus error").

When dereferencing a pointer value, or giving it to another component to dereference, make sure it points somewhere. If you say "char *s;", then you can't say "strcpy(s, something)" immediately after. "s" is uninitialized and you can't assume it points to the zeroth byte of an array where you can store your data. You have to assign it a value pointing to such a thing if you want to use it. Better yet, often you can just declare an array in the first place, rather than a pointer variable.
Be sure you are not exceeding array bounds, including in string manipulation. If "s" is a string, then "char t[5000]; strcpy(t, s);" is an error, because you don't know that s is shorter than 5000 characters. Check lengths with strlen() to be sure that the string will fit in the target array, and don't forget to leave room for the terminating zero byte.
Check error returns from system calls. Even in initial development!
When you do observe (e.g. with an error return from a system call) that something shouldn't be done, make sure you don't do it! I've seen a surprising amount of beginner C code in my life which checks for error returns from system calls correctly and prints a nice error message but then comes out of the 'if' and uses the invalid pointer value anyway, etc. Apparently this is an easy mistake to make (although it's not clear to me why).
You can localize the segfault using gdb. "gdb" is similar to other debuggers you may have used with other programming languages (e.g. in CSC 207). I've written an intro document about using gdb.
Think methodically; form and reject hypotheses. Understand everything you write in your program. The fact that your program does not work is not sufficient motivation to make a particular change. You have to understand the change you are making and have a good reason to do it. Debugging involves understanding what was wrong with your program, not just making the bad behaviour seem to go away.

Various getopt() questions:

See "man getopt". But typing "man getopt" gives you a tool for use in shell programming. So say "man 3 getopt". (And to be clear, you should be using getopt(), not getopt_long(), for assignment two.)

See the supplied example call of getopt() in getopt.c. Please understand that program fully before copying any of it!

Here are some notes about getopt, of which you might want to read the "interface" section, after reading getopt.c above.

You are required to use getopt() for crypt.c rather than parsing the command-line options yourself. All sorts of bizarre syntaxes are possible and will be dealt with automatically by getopt(). In the old days, everyone writing a unix tool parsed the options themselves, and the result was a lot of inconsistency as to whether or not you could do certain things (even including fundamentals such as combining options into one argument, e.g. writing "ls −qa" instead of "ls −q −a"). These days, everyone calls getopt(), and the users of your program may use a feature of standard option parsing which you didn't even know exists. This is good.

Be careful to use getopt() properly. Do not make assumptions as to the format of the command line. The standard unix command-line option format is actually extremely flexible in some ways. For example, these are all valid ways to execute the example getopt.c with '−c' value 17 and with the '−x' option, and a further command-line argument "file":

        ./getopt -c17 -x file
        ./getopt -x -c17 file
        ./getopt -x -c 17 file
        ./getopt -x -c 17 -- file
        ./getopt -x -c17 -- file
        ./getopt -c1 -x -c2 -x -c3 -x -c4 -x -c5 -x -c17 file

And furthermore, none of these is a special case. If you call getopt() correctly, as discussed in the man page and as shown in the supplied example getopt.c, all of these cases and more are handled automatically, without trying, with no special cases. The getopt() library routine contains all of the relevant complexity.

Q: "findempty" takes no command-line options. So if the user does "./findempty −q", is that an error?

A: No, it is a request to search a directory named "−q". That is, this is not a special case. Keep It Simple.

Q: In crypt, do we need to deal with the special case of a command-line argument of "−"?

A: Well, the instructions didn't say to. But you might as well do it, because it's easy; just follow the example cat.c.

Q: How about in whatyear.c?

A: No, because that doesn't make sense. Nor for findempty. Just for crypt.

Q: Do we need to include comments in our code?

A: We do expect C programs to be well-organized and readable, much more than with the shell scripts in assignment one.

"All programs are poems; it's just that not all programmers are poets."

Make your program nice. Keep it simple. Someone who knows C well should be able to read your program without much confusion. Comments can help this process.

On the other hand, do not teach your reader C — assume that your target audience knows C, and knows the problem domain.

I think that the ideal program would be so clearly readable that it would contain no comments at all except for an introductory comment at the top (the "prologue comment"). (I also think that this ideal is often or usually not achievable, and even more often not in fact achieved.)

I've written a lot more about comments in ../comments.html.

crypt always outputs to stdout, whether its input is from stdin or from one or more files whose names are specified on the command line.

Don't focus on input from the terminal (in general). Redirect your input from a file or a pipeline to avoid a host of red herrings, especially with respect to eof-terminated input streams.

Don't output anything other than the transformed file contents. If there are multiple files in crypt, just process them in order with no additional output.

Assuming maximum path lengths:

Q: Can we assume a maximum path name length in findempty.c?

A: Well, sort of. You can set a maximum (make it at least, say, 2000 chars) so that you can declare your array, but if the path name is too long, you must print an appropriate message to stderr and exit; nothing can be permitted to make you exceed the array bounds.

Q: What about the array holding the input line in crypt.c?

Don't have an array holding the line at all! Instead, loop with a simple getc(), storing just one character at a time.

Note that your program also exceeds array bounds (and thus is buggy) if it asks a library function to exceed array bounds, e.g. if you call strcpy(x,y) without basically having in mind a mathematical proof that the length of the string y is such that the data will fit into the array whose zeroth character is pointed to by x.

Q: Does crypt have to store the entire input file so as to be able to perform all of the output only after the user presses ^D?

A: No. The timing of the input and output is not part of the specification. So you should do whatever is easiest in that regard, under the principle of "Keep It Simple".

In general, process data as you go, don't store it.

Q: How does crypt detect whether its standard input is a file or a terminal?

A: It doesn't, and it mustn't. The behaviour must not differ. Don't be "smart". Keep it simple. Process all data until eof, whatever the source of the data.

An example of reading a directory with opendir() / readdir() / closedir() can be found in readdir.c in lab six.
The C "−>" syntax is discussed in https://www.teach.cs.toronto.edu/~ajr/209/notes/struct.html — basically it means the same as Java's "." when used to select members of an object; and for this assignment, you only need to use it exactly as shown in supplied code examples. (You'll get more familiar with these syntaxes later in the course.)

(Actually, x−>y is simply defined as (*x).y.)

Q: What is a DIR* ? (the return value from opendir())

A: It is a very similar concept to a FILE* — it is the information about an open directory which you need to pass to readdir() for it to know which input stream to read from. In fact, an implementation of opendir() and friends which I've read the source code to just defines DIR as FILE in dirent.h. But some of them don't, so you should declare it correctly.

Q: What's the difference between stat() and lstat()?

A: For most directory-tree-traversing programs, including findempty, it's important to use lstat(), as follows.

For the most part, if you attempt to access a symbolic link, the kernel follows this symbolic link automatically, giving you instead the file that the symlink points to. If this weren't the case, then symlinks wouldn't mean what they do mean. A symlink is a stand-in for the pointed-to file.

But you can't have the kernel always following symlinks, only almost-always. For example, an ls −R, or find, would get very confused by symlinks if it called stat() rather than lstat(). In particular, if a symlink points to a parent directory, then to opendir that symlink and continue traversing from there will result in infinite recursion.

So when symlinks were introduced, a dozen or so programs needed to be modified to be able to continue to work in their presence. These days, many more programs need to be aware of symlinks. Anything which traverses a directory tree needs to treat symlinks-which-point-to-a-directory differently from directories. Programs such as "ls" need to collect information on the symlink, rather than the pointed-to file.

The way to do this is to call the special call "lstat()", which is like stat() so long as its parameter is not a symlink. If its parameter is a symlink, it does not follow the symlink, but rather, reports information about the symlink itself.

Thus for example, "ls −l" calls lstat(), not stat(). There is an option '−L' to make ls follow the symlinks, but otherwise it doesn't.

For more examples: "test −f" calls stat(), but "test −L" (check whether the file is a symlink) needs to call lstat().

crypt has no reason to call stat() or lstat(), but if it did, it would call stat, not lstat, because we do want it to follow symlinks, in the normal way.

Q: Why do you have to skip "." and ".." in the findempty recursion?

A: "." is a reference to the directory which "." is in. For example, /u/csc209h/summer/pub/. is the same as /u/csc209h/summer/pub, and /u/csc209h/summer/pub/a2/. is the same as /u/csc209h/summer/pub/a2. Somewhat similarly, ".." is the parent directory, so /u/csc209h/summer/pub/a2/.. is the same as /u/csc209h/summer/pub. This is explained in some detail in the unix filesystem video.

To traverse the directory /u/csc209h/summer/pub (for example), you will recursively traverse all subdirectories, such as /u/csc209h/summer/pub/a2. However, if you recursively traverse /u/csc209h/summer/pub/., that is itself a traversal of /u/csc209h/summer/pub and thus you have an infinite loop (infinite recursion). Similarly, if you recursively traverse /u/csc209h/summer/pub/.., that is the same as /u/csc209h/summer, and you will eventually get back down to /u/csc209h/summer/pub, and also have an infinite loop.

So you have to skip "." and ".." when looking at the contents of a subdirectory. (However, these are still valid directory names for the command-line; make sure you put your 'if' statement in the right place.)

Q: What should the exit statuses of all of the programs be? What's 0, 1, and 2?

Normally, programs exit with exit status zero for success and one for failure. All three assignment two programs are like this — normally the exit status will be zero, but if there is a usage error or if an fopen() fails, the exit status should be one.

Note the two options for how you interpret the command-line key in crypt:
1) you insist that the key be all-lower-case (e.g. "abcDEF" is an error)
2) you interpret the key in a case-insensitive way (e.g. "abcDEF" is the same as "abcdef").

You can do either one of these. Your choice does not affect the usage message. Automated testing will be with either all-lower-case letters, or with some non-letters in there to test your program's fatal error message.

Q: What does it mean that "a chdir() in processing the first directory may invalidate the name of the specified second directory"?

A: First of all, this is only an explanation for why the chdir() strategy is not appropriate for findempty. If you're not considering using chdir(), you don't need to be talked out of it!

But if you're interested:

Consider that when doing directory traversal, if you have a directory named "foo" and a file in it named "bar", rather than constructing the pathname string "foo/bar", you could just do chdir("foo"), and use the name "bar". After processing the directory foo, you do chdir("..").

This is slightly easier than the string operations, but it's often not worth it. You need to put together the path name for output anyway, so why not put it together to pass to opendir() first?

But more to the point, if the command-line is something like "findempty /a/b/c d/e/f", after you chdir("/a/b/c") and to subdirectories, no amount of chdir("..") is going to get you back to the directory you were originally cd'd to when the program started. So the pathname "d/e/f" is not going to work. So you can't use this chdir() strategy for findempty.

Q: Why am I not allowed to use ftw() or fts() to write findempty?

A: Because it contains the basic directory traversal code which is the point of this assignment. Some people can call ftw(), but someone else has to write ftw(). This assignment is about writing the directory traversal code.

Q: Should I use realpath() or getcwd() to find the path name for a directory/file?

A: No. Reply in the user's own terms. If they specify a pathname such as "foo/bar", then you will output file path names such as "foo/bar/baz", which are valid if foo/bar is valid. Don't be "clever" about this, just do it the obvious and simple way.