Introduction
Announcements

Schedule
Labs
Assignments
TA office hours

Tests, exam

Topic videos
Some course notes
Extra problems
Lecture recordings

Discussion board

Grades so far

Newline conventions and internet communications

A text file is a sequence of zero or more "lines". A line of a text file is a sequence of zero or more non-newline characters followed by a newline.

What's a newline?

Different operating systems have different newline "conventions". The ASCII character standard says that if you are going to use a single byte to separate lines, you should use byte number 10, which we also call "control-J", or "line feed" or "LF". Using this single character is the newline convention for unix, so in unix we often simply call this byte value the "newline character", and we get it in C in unix by typing "\n".

MS-Windows uses a two-byte sequence to separate lines in a text file. The original ASCII standard actually says that this is preferred; and it was common in the 1970s and earlier. (I think that this preference for a two-byte sequence is unfortunate, and that it's helpful that unix uses a single byte.) These two bytes are what we could call "control-M" and "control-J". "Control-M" is also known as "carriage return" or "CR". Together, this two byte sequence is called "CRLF".

Some other operating systems have other newline conventions.

These differing newline conventions present an interoperability problem. You may have had a problem at some time in your life with transferring files from MS-Windows to unix, such that your file ends up with a bunch of ^Ms in it on unix. You'll have this whenever you transfer a file between this combination of operating system types without performing a newline conversion between the MS-Windows newline sequence and the unix newline sequence. Such a conversion is often a feature of file transfer programs... but it must be optional because it would make a mess of non-text files.

How do we make the internet interoperable?

If dissimilar computers are to be able to communicate over the internet, the data transmitted over the internet must not be machine-type-specific. That is, if we intend to transmit the same data from a unix machine and from a CRLF-using machine, those two transmissions must be the same sequence of bytes.

In the case of transmitting data which we consider to be integers, we need to standardize on the byte order. The standard network byte order is big-endian; a "little-endian" machine must swap bytes in integers when copying them to and from network transmission buffers. This is discussed in the Haviland et al textbook, and in the videos.

In the case of transmitting text, the ASCII standard gives us standard byte values for just about everything except newlines. So we need to adopt a newline standard for network text transmission. (This is not discussed in that textbook; it seems to me that it's an omission.)

The way in which you encode the newline concept in bytes is called a newline "convention". Just as we have a network byte order, we have a network newline convention.

The network newline convention is CRLF. That is, a newline is represented by the two bytes (in order) which we could call CR and LF, or control-M and control-J, or 13 and 10, or \015 and \012.

How do we write C code to specify a CRLF?

In unix, \n means LF. (In MS-Windows, \n should expand to the two characters ^M and ^J, by the time it gets written to a file on disk, although the C standard requires it to seem to be a single character to your C program.)
Also, in unix you can use \r for CR. So we can write the network newline convention in a string in unix as \r\n. This is, then, a unix-specific encoding of this purposefully-non-unix-specific network newline sequence.

In general in C, we could write "\015\012". This is then not unix-specific. However, your assignment four is probably already unix-specific in more substantial ways than this, and it's fine to write "\r\n" in your assignment four, in my opinion. I think that "\r\n" is the usual way this is written in unix network communication programs. On the other hand, it's certainly at least as good to write "\015\012" instead (if not better).

The "network newline convention" is the rule that you convert to the network newline convention upon transmission (e.g. in unix you put ^Ms before all ^Js) and that you convert from the network newline convention upon receipt (e.g. in unix you either ignore all ^Ms, or treat them the same as ^Js if blank lines don't matter, or do the more thorough operation of converting all CRLF pairs to just LF).