Information Theory

Shannon Information Theory: Quantifying the Unquantifiable

How many bits does it take to communicate surprise? Claude Shannon solved this in 1948 and invented the digital age.

01 — The Fundamental Question

What is information, exactly?

Before Shannon, "information" was a vague concept — knowledge, facts, data. Shannon made it precise and measurable. His insight: information is surprise. The more unexpected an event, the more information it carries.

Consider two messages:

Message A

"The sun rose this morning."

Message B

"A meteor just struck Paris."

Both are grammatically similar, but message B carries far more information. Why? Because it's unexpected. The sun rises every day — learning it happened again tells you almost nothing new. A meteor strike is rare and changes your model of the world dramatically.

Shannon's definition: Information is the reduction of uncertainty. An event that you knew would happen carries zero information. An event that was completely unexpected carries maximum information.

Information as bits

Shannon measured information in bits — binary digits. One bit is the amount of information gained from learning the outcome of a fair coin flip. Two equally likely outcomes = 1 bit.

Four equally likely outcomes (two coin flips) = 2 bits. Eight outcomes = 3 bits. The pattern: if there are N equally likely outcomes, you need log₂(N) bits to specify which one occurred.

I(event) = -log₂(P(event)) // I = information in bits // P = probability of the event // Negative sign: low probability → high information

Examples:

P = 1/2 (coin flip): I = -log₂(0.5) = 1 bit
P = 1/8 (one face of 8-sided die): I = -log₂(0.125) = 3 bits
P = 1 (certain event): I = -log₂(1) = 0 bits (no surprise)
P = 1/1,000,000: I ≈ 20 bits (very surprising)

02 — Average Information

Entropy: the expected surprise

Individual events have information content. But what about a random variable — a whole probability distribution? Shannon defined its entropy as the average information you'd expect to receive per symbol.

H(X) = -Σ P(xᵢ) · log₂(P(xᵢ)) // H = entropy (average bits per symbol) // Sum over all possible outcomes xᵢ // Each outcome weighted by its probability

Intuition

Think of entropy as "how hard it is to guess the next symbol." A perfectly predictable sequence (AAAAAAA...) has entropy = 0. A maximally random sequence (uniform distribution) has maximum entropy.

Maximum entropy: For N equally likely outcomes, H = log₂(N). A fair coin (2 outcomes) has H = 1 bit. A fair 8-sided die has H = 3 bits. English text has about 1.2 bits per letter (far from the theoretical max of log₂(26) ≈ 4.7 bits).

Why is English low-entropy?

Because letters aren't equally likely, and they're not independent. 'E' appears 12.7% of the time, 'Z' only 0.07%. After 'Q', the next letter is almost always 'U'. This redundancy is why compression works — and why autocomplete works.

03 — Interactive Demo

Entropy Calculator

Type or paste any text. The calculator computes character frequencies and entropy — how many bits per character would be needed in an optimal encoding.

Text Entropy Analyzer

Characters

—

Unique symbols

—

Entropy (bits/char)

—

Max possible (uniform)

—

Redundancy

—

Show frequency table

04 — Optimal Encoding

Huffman Coding: bits match probabilities

If some symbols are more frequent than others, we can compress by assigning shorter codes to frequent symbols and longer codes to rare ones. Huffman coding is the optimal way to do this.

The algorithm

Build a binary tree bottom-up by repeatedly merging the two least-frequent symbols. The path from root to leaf gives the code: left = 0, right = 1.

Huffman Encoding Demo

Example: the string "MISSISSIPPI" has frequencies M:1, I:4, S:4, P:2. Huffman assigns variable-length codes.

Fixed-length (2 bits/char)

M=00, I=01, S=10, P=11

Total: 11 chars × 2 bits = 22 bits

Huffman (variable-length)

I=0, S=10, P=110, M=111

Total: 4×1 + 4×2 + 2×3 + 1×3 = 19 bits

Huffman achieves the entropy limit: average code length ≈ entropy. This is why ZIP, PNG, and MP3 use variants of Huffman encoding.

Shannon's Source Coding Theorem: It is impossible to compress data below its entropy without losing information. Huffman coding gets arbitrarily close to this limit. Entropy is the fundamental bound on lossless compression.

05 — Noisy Communication

Channel Capacity: the fundamental limit of communication

Real communication channels are noisy — bits get flipped, signals get corrupted. Shannon asked: what is the maximum rate you can reliably transmit information over a noisy channel?

The answer is the channel capacity, measured in bits per second. Surprisingly, as long as you transmit slower than capacity, you can make the error rate arbitrarily small with clever error-correcting codes.

C = B · log₂(1 + S/N) // C = channel capacity (bits/second) // B = bandwidth (Hz) // S/N = signal-to-noise ratio (linear, not dB) // This is the Shannon-Hartley theorem

Channel Capacity Interactive

Bandwidth (MHz): 20

Signal/Noise ratio (dB): 20

S/N (linear)

—

Capacity (Mbps)

—

Max error-free data rate

—

Shannon's Noisy-Channel Coding Theorem: For any channel with capacity C and any transmission rate R < C, there exist error-correcting codes that make the probability of error arbitrarily small. This is what enables reliable cell phone calls and WiFi despite terrible signal conditions.

06 — Where It Lives

Applications of information theory

Data Compression

ZIP, GZIP, PNG, JPEG, MP3, H.264 — all use entropy coding (Huffman or arithmetic coding) to approach Shannon's limit. You cannot compress random data. You can only compress data with patterns (low entropy).

Machine Learning

Cross-entropy loss is the standard loss function for classification. Minimizing cross-entropy = maximizing likelihood. Mutual information measures how much one variable tells you about another — used in feature selection.

Cryptography

Perfect encryption produces ciphertext with maximum entropy — indistinguishable from random noise. Shannon proved the one-time pad is perfectly secure (and also proved it requires a key as long as the message).

Neuroscience

How much information does a spike train carry about a stimulus? Information theory quantifies neural efficiency — bits per spike, bits per joule.

Physics & Thermodynamics

Boltzmann's entropy (S = k log W) and Shannon's entropy have the same mathematical form. Landauer's principle: erasing 1 bit of information costs at least kT ln(2) energy — information is physical.

The connection to Fourier & Heisenberg

Bandwidth-duration trade-off: A signal narrow in time has a wide frequency spectrum (Fourier). A signal narrow in frequency must last a long time. This is mathematically identical to the Heisenberg uncertainty principle — and both follow from properties of the Fourier transform. Information theory, signal processing, and quantum mechanics share deep mathematical structure.