Randomization formulas

There are several places in your program in which you need to make a random selection of some kind. Here is an explanation of each one.

"Discrete" selections are over a finite set, often integers in a certain range, or amongst a handful of possibilities, perhaps described non-numerically (e.g. randomly choose whether you want to go left or right). "Continuous" selections are over a portion of the real number line.

The random selections in assignment three are, in order in the handout:

  1. a uniformly-random selection of number of bytes in a packet, between 64 and 1518 inclusive
  2. a uniformly-random number of microseconds between 50 and 500000 (this could be a continuous selection, but we'll make it an integer number of microseconds)
  3. a uniformly-random selection of number of bytes between 10 less than and 10 more than the previous packet size but not going outside the range 64..1518
  4. an exponentially-randomly-distributed number of bytes with minimum size 64 and mean 100
  5. a uniformly-random number of bytes between 100 and 100000
  6. a uniformly-random number of microseconds between 1000 and 2000 (again, we'll make it an integer)
  7. an exponentially-distributed random time with a mean of 2000000 microseconds

Let us now go through the development of various randomization formulas which will produce all of the numbers above.

First of all, the rand() library function (declared in <stdlib.h>) produces not a real number selection in the half-open interval [0,1) as we like, but rather a random non-negative integer. Different versions of the C library (on different platforms) produce numbers in a different range, so there is a value RAND_MAX (defined in <stdlib.h>) which states the maximum value which rand() can return on that machine.

If we want a real number selection in the half-open interval [0,1), we can thus use the formula

	rand() / (RAND_MAX + 1.0)
It is important that the addition of RAND_MAX + 1 is performed in double-precision, because the number of bits in an int may well be greater than the number of bits in the mantissa of a single-precision floating-point number. However, "1.0" is of type double in C, not float, so RAND_MAX will be converted to type double to perform the addition.

It is also important that the division occur in double-precision floating-point, and especially that it is not integer division. But the mixed-mode arithmetic rules will assure this too for the above expression, because the denominator is of type double.

We can now produce the selection required by all of the uniformly-random selections in our initial list. If we wanted a continuous selection in [0,x) for any x, we could just multiply the above by x. But in fact all of the uniform selections above are discrete.

If we take the formula which produces a value in the half-open interval [0,1) and multiply by the number of choices, we get a number whose integer part has the appropriate number of possibilities and is uniformly-randomly selected.

So, for example, a random selection amongst 0, 1, 2, and 3 is yielded by the formula

	floor(4 * (rand() / (RAND_MAX + 1.0)))
Strangely, floor() (which is in the math library) returns type double in C, but it can safely be converted to an integer.

A considerably simpler formula also apparently yields this same selection:

	rand() % 4
However, this only works for a fairly good random number generator. If you think about the representation of the integer result of rand() in binary, this formula simply takes the least-significant (rightmost) two binary digits of that number. There are several PRNGs whose generated number streams are apparently of good enough quality but in which some of the bits, most commonly the less significant bits, are not of good random quality in isolation. rand() is traditionally such a PRNG and you shouldn't use it with the mod operator in this way unless you know that the PRNG in the particular version of the C library you are using is of superior quality.

Most of the uniformly-distributed selections above have a non-zero minimum value. This can simply be added to the above formula. For example, if we want to choose a random number of bytes between 64 and 1518 inclusive, this is 1518-64+1 different possible values, which is 1455; so the formula will be

	floor(1455 * (rand() / (RAND_MAX + 1.0))) + 64

Obviously, you should write a function which returns a uniformly-distributed discrete selection in a certain range, where the function parameters are the lower bound and upper bound, inclusive. This will take care of types #1, #2, #5, and #6 above.

It's also close to taking care of type #3. The trick in type #3 is the fact that the packet size cannot go below 64 nor above 1518. So for example, if the received packet is 1517 bytes, then there are only 12 possible values for the next packet size: 1507, 1508, 1509, 1510, 1511, 1512, 1513, 1514, 1515, 1516, 1517, and 1518. They should have equal probability. (There isn't much work left to do here, but we want you to do it.)

Finally, the exponential distribution. For random selections #4 and #7, we have a minimum (which is 0 in item #7) and a mean, with no set maximum. That is, we have a particular instance to hand, a particular host deciding how long to wait before issuing the next command; and we want to generate a waiting interval such that overall, the average time between events is two seconds. This is non-trivial because a shorter waiting time means more events; you can't just make a uniform random selection with a mean of two seconds.

Instead, we base it on the "exponential distribution" which looks something like e-t, where "e" is the base of the natural logarithm (2.718281828...).

Read the notes about a cumulative probability distribution function. If for a random variable X we have a cumulative probability distribution function F(x), then we can map a uniform random selection "u" in [0,1) to a selection of a value for X by finding the lowest x such that F(x) >= u (note that the cumulative PDF is necessarily non-decreasing). For a continuous random selection, this is the inverse of the function F.

The probability density function for the exponential distribution is 0 for t<0, and e-t for t>=0. This function f(t) is the probability density function for the time between consecutive events, if they occur at a particular average rate. (This exact function yields an average value of 1, not 2, but we'll deal with that afterwards.)

If f(t) is e-t for t>=0 and 0 for t<0, then F(x), the cumulative probability distribution function for this probability density function, is the integral from minus infinity to x of f, which is 1 - e-x.

If g(x) = y = 1 - e-x,
e-x = 1 - y
-x = ln(1-y)
x = -ln(1-y)

This is an exponential selection with a mean of 1. If we multiply the value by q, we get an exponential selection with a mean of q.

We'd better not attempt to calculate ln(0). But if y is a uniform random selection in [0,1), then 1-y is a uniform random selection in (0,1], so it all works out.

So, putting it all together, the following C function returns an exponentially-distributed random value with a mean of 2000000 (microseconds), suitable for use as the wait between the completion of the server reply and the initiation of a subsequent client request, for hosts of type 3 as described in the assignment handout:

double exp2Mrand()
{
    return -log(1 - rand() / (RAND_MAX + 1.0)) * 2000000.0;
}
log() is a function in the math library which returns the natural (base e) logarithm of its argument.

(Discussions of the exponential distribution often use a "rate" instead of this mean value. The "rate" is simply the reciprocal of the mean. A higher rate indicates more arrivals; they are faster; the rate of arrival is higher. The formula in the readings, for example, involves dividing by the rate; so for a mean of 2000000, we would use a rate of 0.0000005, and dividing by 0.0000005 is the same as multiplying by 2000000.)

Since the range is two million, you can take the floor() of this, or round() of it, and it won't matter much; either is acceptable for this assignment. You might want to make your exp2Mrand() do this, and return type long instead of double.

But item #4 is slightly more complex for two reasons. First of all, it has a minimum of 64, but that can be dealt with somewhat similarly to the minima in the discrete uniform cases earlier above. But most of all, if you just take the floor of the -log(...) above, you will not end up with a mean of 100; it will be slightly less, because you have subtracted an average of 0.5 from all the values. You should come up with a formula which yields an integer but still has a mean of 100, a minimum of 64, and an appropriately-weighted chance of returning 64. Again, this is for you to figure out how to deal with.


[a3 q&a] [main course page]