Information Theory
How many bits does it take to communicate surprise? Claude Shannon solved this in 1948 and invented the digital age.
01 — The Fundamental Question
Before Shannon, "information" was a vague concept — knowledge, facts, data. Shannon made it precise and measurable. His insight: information is surprise. The more unexpected an event, the more information it carries.
Consider two messages:
"The sun rose this morning."
"A meteor just struck Paris."
Both are grammatically similar, but message B carries far more information. Why? Because it's unexpected. The sun rises every day — learning it happened again tells you almost nothing new. A meteor strike is rare and changes your model of the world dramatically.
Shannon measured information in bits — binary digits. One bit is the amount of information gained from learning the outcome of a fair coin flip. Two equally likely outcomes = 1 bit.
Four equally likely outcomes (two coin flips) = 2 bits. Eight outcomes = 3 bits. The pattern: if there are N equally likely outcomes, you need log₂(N) bits to specify which one occurred.
Examples:
02 — Average Information
Individual events have information content. But what about a random variable — a whole probability distribution? Shannon defined its entropy as the average information you'd expect to receive per symbol.
Think of entropy as "how hard it is to guess the next symbol." A perfectly predictable sequence (AAAAAAA...) has entropy = 0. A maximally random sequence (uniform distribution) has maximum entropy.
Because letters aren't equally likely, and they're not independent. 'E' appears 12.7% of the time, 'Z' only 0.07%. After 'Q', the next letter is almost always 'U'. This redundancy is why compression works — and why autocomplete works.
03 — Interactive Demo
Type or paste any text. The calculator computes character frequencies and entropy — how many bits per character would be needed in an optimal encoding.
04 — Optimal Encoding
If some symbols are more frequent than others, we can compress by assigning shorter codes to frequent symbols and longer codes to rare ones. Huffman coding is the optimal way to do this.
Build a binary tree bottom-up by repeatedly merging the two least-frequent symbols. The path from root to leaf gives the code: left = 0, right = 1.
Example: the string "MISSISSIPPI" has frequencies M:1, I:4, S:4, P:2. Huffman assigns variable-length codes.
M=00, I=01, S=10, P=11
Total: 11 chars × 2 bits = 22 bits
I=0, S=10, P=110, M=111
Total: 4×1 + 4×2 + 2×3 + 1×3 = 19 bits
Huffman achieves the entropy limit: average code length ≈ entropy. This is why ZIP, PNG, and MP3 use variants of Huffman encoding.
05 — Noisy Communication
Real communication channels are noisy — bits get flipped, signals get corrupted. Shannon asked: what is the maximum rate you can reliably transmit information over a noisy channel?
The answer is the channel capacity, measured in bits per second. Surprisingly, as long as you transmit slower than capacity, you can make the error rate arbitrarily small with clever error-correcting codes.
06 — Where It Lives
ZIP, GZIP, PNG, JPEG, MP3, H.264 — all use entropy coding (Huffman or arithmetic coding) to approach Shannon's limit. You cannot compress random data. You can only compress data with patterns (low entropy).
Cross-entropy loss is the standard loss function for classification. Minimizing cross-entropy = maximizing likelihood. Mutual information measures how much one variable tells you about another — used in feature selection.
Perfect encryption produces ciphertext with maximum entropy — indistinguishable from random noise. Shannon proved the one-time pad is perfectly secure (and also proved it requires a key as long as the message).
How much information does a spike train carry about a stimulus? Information theory quantifies neural efficiency — bits per spike, bits per joule.
Boltzmann's entropy (S = k log W) and Shannon's entropy have the same mathematical form. Landauer's principle: erasing 1 bit of information costs at least kT ln(2) energy — information is physical.