Errata for entropy

Before diving into the second part of the previous post, I thought it might be useful to fix the misconception about entropy first.

If you ever took a course in information theory, you are familiar with the concept of entropy. When I was taking the course, the concept was introduced to me as the amount of unknown information. Of course, in order to make sense out of this sentence, one must define what information is. I remember the very first lecture, when our instructor was trying to tell us what is information and what is not. Let me quote him first.
"Imagine that you lose sleep, and got up at 3:00 am in the morning. You are surfing in the web, and you came across with the news about an earthquake in California, happened just an hour ago. Then you went back to bed, and woke up at 8:00 am. When you were drinking your coffee and reading the newspaper, you saw the news about the earthquake in California again. Now, which news - the 3:00 am or the 8:00 am - contains information?"
The answer is, as you already have guessed, is the 3:00 am news. Why? Because when you read the same news on the newspaper, it doesn't say anything to you that you don't know. It doesn't contain any information for you. If the guy sitting at the next table (who is also reading the same newspaper) had a sound sleep, then the 8:00am news definitely contains information for him.

At this point, I guess the remaining part of the lecture depends on which department you are taking the course from. As electronics engineering students, we quickly jumped to the calculation of entropy, since our inspiration was the 1948 paper of Claude Shannon (which is an astonishing paper by the way). Our aim was to determine the magnitude of uncertainty in a given binary string. Lower the uncertainty, lower the minimum amount of bits required encode it. Eventually, amount of information was equal to the amount of uncertainty.

For instance, consider the three codewords below \begin{align} C_{1} = [1\,\,1\,\,1\,\,1\,\,1\,\,1\,\,1\,\,1\,\,1\,\,1], \nonumber \\ C_{2} = [1\,\,0\,\,1\,\,0\,\,1\,\,0\,\,1\,\,0\,\,1\,\,0], \nonumber \\ C_{3} = [0\,\,1\,\,0\,\,1\,\,1\,\,1\,\,0\,\,1\,\,0\,\,0]. \nonumber \end{align} Which one requires the least amount of bits to encode? Try to describe each of them to the person sitting in front of you. Eventually, it is a communication problem. Let's begin with the easiest one, \(C_{1}\). You could immediately say "Ten consecutive 1's". That's done. Now try for \(C_{2}\). "1 and 0, repeated five times". This is also out of the way, but the number of words you used to describe it was higher than the previous one. Now let's try \(C_{3}\). "0 and 1, repeated two times, then two 1's and a 0, then a 1 and two more 0's". Not a short description, right?

So what is key difference between these codewords that caused you to describe them in different number of words? You used the leverage of repetition. You've counted the patterns, and you've just transmitted the information of a single pattern plus the information of how much it's been repeated in the codeword. And in the case of \(C_{3}\), when you couldn't describe the codeword in terms of a repetitive pattern, you just described the whole codeword itself, which costs you additional number of words.

This is the point where most people get confused and associate entropy with disorder, which is totally a fallacy. When we don't see patterns, i.e., order, we feel like there is more uncertainty in what we're looking at. But here comes the trick. Think about how you described \(C_{3}\). You knew it in the first place, right? Just because you can't describe it in terms of patterns doesn't mean that you haven't described it at all. Yes, probably it will take more bits to encode it after a proper compression, but still, it's the 8:00 am news for you.

So we need to go way back to what uncertainty means. What if I told you that I have a codeword containing 10 bits, half of it is 1 and the other half is 0? Now we have an unknown about the codeword. We don't know the exact locations of the 0's and 1's. We don't know whether the codeword is \(C_{2}\) or \(C_{3}\). They both contain five 1's and five 0's, but their locations are uncertain under the limited information provided to us. What we only know is the probability of seeing a 1 in a given bit, i.e., \(p_{1}\), which is, in this case, is equal to 0.5.

Now apply this line of though to the codeword \(C_{1}\). Would it be hard for you to guess if I told you that I have a codeword containing 10 bits, and all of them are 1? It's not hard to find, since there is only one possible combination when \(p_{1}=1\). Now we are getting somewhere. In order to define a measure of uncertainty, you need to define a system (a codeword of length 10, with five 1's and five 0's), and have an unknown about it (locations of the 1's and 0's). Amount of uncertainty rises from how many different realizations you can generate without violating the properties of your system (How many codewords you can generate of length 10, containing five 1's and five 0's?), not the individual disorder related to each realization. It's kind of a measure of freedom of action under certain pre-defined conditions.

Then what determines the number of possible realizations? Obviously, the number of values that your unknown parameter can take. Mathematically speaking, as the probability distribution of your unknown parameter gets wider, the amount of uncertainty gets larger.

Now, let's link this idea to thermodynamics.

Consider the Figure 1 below, where both systems contain identical boxes with six identical marbles inside. The marbles on the left are located randomly and they are completely wedged, meaning that they have no place to move. On the contrary, marbles on the right are intentionally and nicely ordered side by side, but since there is free room left in the box, they have a little space to move. So which box seems to be more ordered? Which one seems to have more entropy?
Figure 1: Randomly stacked versus nicely stacked marbles.


You smelled the rat, right?

Stacking marbles randomly (as in the box on the left) reduces their freedom of motion, leading to a narrower energy distribution. On the contrary, due to the little room that the nicely stacked marbles have, they have freedom to move, and a wider energy distribution. Such a simple example where higher entropy means more order.

This example can also be analogous to the gas molecules diffusing in a closed volume. As the molecules diffuse, they hit the borders of the container or they hit each other, and they transfer energy. Even if we knew the exact location and temperature of each molecule at the beginning, it becomes impossible to track this information after all those collisions going on in the container. As molecules chaotically clash, their energy distribution becomes wider and wider. It's not only their random places that increases the entropy, it's their widening energy distribution1.

Long story short, entropy can not be simply identified with disorder. Instead, entropy measures how much the energy is spread out. Sometimes an orderly-appearing system may have energy that is more dispersed than that of a disordered system. Such a system, although seemingly more ordered, would have higher entropy[1].

References
[1] Peter M. Hoffman, Life's Ratchet : How molecular machines extract order from chaos.

Footnotes
1) When I say energy, I mean the combination of gravitational(potential), kinetic and thermal energy.