“This duality can be pursued further and is related to a duality between past and future and the notions of control and knowledge. Thus we may have knowledge of the past but cannot control it; we may control the future but have no knowledge of it…” Claude Elwood Shannon
In Part I of “All About Entropy” we’ve delved some into the standard thermodynamics definition of entropy. Part II shall deal with a seemingly different type of entropy introduced by information theory founder Claude Shannon.
Although it would look somewhat disconnected from the previous part, we shall continue our voyage of “All About Entropy…” and connect the dots perfectly through a silk thread.
Again, I am not in any way a renowned expert on the subject (far from it…😉 ), so no post will by no means be all encompassing or highly accurate (you can get that from very in-depth academic resources…), nor it is intended to be such…
I therefore have no thought that this blog might be a benefit to all others, I wrote it only to sustain my understanding.
I hope for the essence to be captured, and even more so communicated…✨✨✨
We have a basic intuition on how to measure the amount of information in a message. A pink slip for example contains a small amount of information, while a book, a large amount. How do we assume that, simply because of the length of the message, or in other words, the number of words in a pink slip versus a book.
Now, can we compare a message of different kinds?… like the book above and a John Lennon photograph… The way we can do that is by the use of binary digits – bits. We can simply count the number of ones and zeros we need to represent a specific message.
Upon examining my HD I find that the john Lennon picture takes about the same number of bits as my T.J. Chung turbulence books – a few Megabytes (so maybe a picture does worth a thousand words… 🙌)
But this is not quite accurate. It depends upon the code that we use to express the message. A teletype and ASCII code do not use the same number of bits per letter, meaning that a thousand word message would take 5000 bits in the former and 8000 in the latter. Different codes for pictures could take even a ten folds and more of bits in compared to one another. Let’s solve this dilemma…
Say we have a source of messages (could be anything you want as long as it produces messages), and we ask how many different messages might this source produce?
Let us denote that source dependent number by M. So now M doesn’t depend on the code chosen, but once a code is chosen, we need M different code words.
M is actually the number of possibilities in a code. The larger M is the more amount of information could convey.
Now let us remember the engraved formula on Ludwig Boltzmann’s tombstone and the log it had. supposed now we had two messages for each of those there are M possibilities, so the to messages have M X M possibilities, though our intuition says two messages should have just twice the information. Log to the rescue…
Logarithms were introduced to the our cosmos by the Scottish mathematician John Napier (1550-1617),
and although we honor him deeply for that we will not go over log rules in an “All About CFD…” post, but finally Entropy enters the picture.
According to Claude Shannon (1916-2001), the amount of the information contained in a message by the source is measured by its entropy, defined H (the H itself was actually offered before by Ralph Hartley, a bell lab engineer and is certainly a precursor to Shannon’s work) is (…and check out that the choice of the natural logarithm makes our entropy measured in bits):
or, in other words using our convention:
When M=1 the entropy H=0. Sounds obvious but this sort of a hint to definition of entropy soon to come…
- H could be interpreted as the amount of information we get upon receiving a message.
- H is the amount of information we lack upfront receiving the message. or in other words a measure of our uncertainty about the message.
The second definition is the most interesting one as it relates to passwords and code breaking. I an anonymous bad person would like to login my LinkedIn account he would have had to guess my password. Say LinkedIn passwords carry only four digits. Then there are 10,000 passwords he could have guessed to ensure breaking into my account. A robot might have tested this amount of possibilities in a very short time.
The entropy of a password is then a good measure of its level of security ”(secluding other measures like a number of mistaken passwords allowed to make…) of a password. Meaning the entropy of a password measures how much the anonymous bad person does not know and has to guess.
The entropy of the first digit in the code is:
Therefore a four digit password has four times that entropy which means 13.3 bits.
Suppose instead my password had an eight letter word then the entropy would become:
And therefore an eight letter password would have eight times this entropy which is 37.6 bits.
If it took a robot one second to go over a 4 digit password it would have taken it about a half a year to work over all the options of the 8 letters passwords. Well, maybe not because some of us are stupid enough to put an eight letter word we are sure to remember which lowers the entropy considerably… So if you want a secured password the whole idea is to increase the number of available symbols, demand combining them, and as random as possible. But although coding is very subject related, were gonna leave it out of this post…
We are going to give a more precise definition for the Hartley-Shannon Entropy though:
- A measure of the message that is independent on the message representation.
- The entropy of a pair of messages is the some of each of the individual messages entropy.
- The measure of information is related to the message length.
A nice example to emphasize entropy is thinking about bits, zeros and ones as yes or no questions. Now a riddle…
Suppose you have 1,000,000 jars in one of them I place a coin. which is the minimum of yes/no answers you would need to ensure finding out where the coin is placed had you left the room before me choosing the jar? (Answer: 20… Why?)
Shannon’s Probability Principle
It would be enough to read this post to realize that the frequency letters appear in the English language us non-uniform, specifically it looks like this:
Suggest now we want to build a tsunami system. We know that the probability of having a tsunami is almost zero, and the probability of one not occurring is almost one:
It’s obvious without either if your a “Bayesien” or a “frequentist” (because this kind of experiment doesn’t happen quite often for them to feel in agreement with each other…) that the if lower probability scenario where the alarm sounds shall find ourselves more experiencing surprise. Our lives could change forever…. `The more surprise in a message, tells us more. So to actually measure the information content in a message, we need a measure of for the surprise.
We shall call the surprise from a message m: S(m), and we know that this surprise depends on the likelihood of the message (no alarm with high likelihood – no surprise…).
In information theory, the surprise measure is based on the logarithm of the likelihood in the following way:
Just as we expect:
If we recall when we defined M for events that where equally likely, sothe surprise of such a message would be:
So now we see that surprise, measured in bits is the amount of information in a system. The less likely a message is the less it conveys more information. That’s what stands in the heart of the Hartley-Shannon entropy definition.
Yet we still need a measure for entropy, the overall amount of information in the source. So we need to figure out how the surprise we find in a message is related to the entropy of the source. Here comes Claude Shannon to the aid:
Entropy is the average surprise
So we need some means to take averages, and we use a numerical value named F(x). we have a variable x containing possible events s but now F(x), a numerical value which will represent a situation where we roll aa die and count the number of dotes on the roll (a number between 1 and 6), so the average number would be the some of the probabilities of times its numerical value:
(So for the die the average would be 3.5, even though we’d never expect a result of 3.5 on a die roll… ).
And the Average Surprise is:
Shannon’s formula for the entropy for the average surprise.
In short sum, Entropy has a HUGE meaning for coding and Shannon’s formula made it possible go far beyond his predecessors as far as code breaking is concerned.
This ends Part II of “All About Entropy…”. In Part III and IV we shall delve into some more exotic topics and also form our shining silk thread to own a holistic view of the beauty of entropy.