4. Data Compression

Existence of Code

A source code $C$ for a random variable $X$ is a mapping from $\mathcal{X}$ to $D^*$

$X$: the range of $X$
$D$-ary alphabet is $\mathcal{D}={0,1,\cdots,D-1}$
$C(x)$: codeword for $x$
$l(x)$: length of $C(x)$
$L(C)=\sum_{x\in\mathcal{X}}p(x)l(x)$: the expected length of a source code $C(x)$ for a random variable
rate $R$: average length

Nonsigular code

every element of the range $X$ maps into a different string in $D^*$
$C^*$: extension of a code $C$ is mapping from finite length strings of $X$ to finite-length strings of $D$, $C(x_1x_2\cdots x_n)=C(x_1)C(x_2)\cdots C(x_n)$
Uniquely decodable: extension is nonsigular

Prefix Code (instantaneous code)

no codeword is a prefix of any other codeword
Tree representation
- Kraft Inequality (1949): For any instantaneous code over an alphabet of size $D$, the codeword lengths $l_1,l_2,\cdots,l_m$ must satisfy the inequality $\sum_{i=1}^mD^{-l_i}\leq 1$
- Conversly, given codeword satisfy above inequality, the prefix code exists.
Interval representation
- Extended Kraft Inequality: For any countably infinite set of codewords

Optimal Codes

optimal: $\sum_{p_il_i}$ is minimal
optimization problem: Lagrange $$\min_{l_i} L=\sum p_il_i\\sum D^{-l_i}\leq 1$$
$p_i=D^{-l_i},l_i=-\log p_i,L^*\geq H_D(X)$
Bounds: $H_D(X)\leq L^*<H_D(X)+1$
Shannon codes: $l_i=\lceil\log_D\frac{1}{p_i}\rceil$

Encode $n$ symbols together (block encode to remove 1 in $H_D(X)+1$)

$\frac{H(X_1,X_2,\cdots,X_n)}{n}\leq L^<\frac{H(X_1,X_2,\cdots,X_n)}{n}+\frac{1}{n}$, if stationary stochastic process, $L^\rightarrow H(\mathcal{X})$
shortcome: alphabet has $|X^n|$

Wrong code

The expected length under $p(x)$ of the code assignment $l(x)=\log\frac{1}{q(x)}$: $H(p)+D(p|q)\leq E_pl(x)<H(p)+D(p|q)+1$

Kraft Inequality for Uniquely Decodable Codes (McMillan): The codeword length of any uniquely decodable $D$-ary code must satisfy the Kraft inequality $\sum_{D}^{-l_i}\leq 1$. Conversely, given a set of codeword satisfy this inequality, it is possible to construct a uniquely decodable with these codeword lengths. (prefix code is ’no less than’ any other code)

Codes

Huffman Codes

$D$-ary Huffman codes (prefix code) for a given distribution: Each time combine $D$ symbols with the lowest probabilities into a single source symbol, until there is only one symbol
add dummy symbols to the end of the set of symbols: $1+k(D-1)$
optimal: $C’$ is any other uniquely decodable code, $L(C)\leq L(C’)$

Shannon Codes

$D$-adic distribution: $P(X=x_i)=D^{-n}$, $l_i=\log\frac{1}{p_i}=n_i$
Shannon codes
- attain optimality within 1
- $D$-adic: Shannon codes are optimal
- $p_i\rightarrow0$: Shannon codes may be worse

For any optimal coding scheme

lengths are ordered inversely with probability ($p_j>p_k,l_j\leq l_k$)
the two longest codewords have the same length
two of the longest codewords differ only in the last bit

Shannon-Fano-Elias Coding

$F(x)=\sum_{a\leq x}p(a)$
modified cumulative distribution function: $\overline F(x)=\sum_{a<x}p(a)+\frac{1}{2}p(x)=F(x)+\frac{1}{2}p(x)$
Truncate $\overline{F}(x)$ to $l(x)$ bit, $\overline F(x)-|\overline{F}(x)|_{l(x)}\leq\frac{1}{2^{l(x)}}$
$l(x)=\lceil\log\frac{1}{p(x)}\rceil+1$, $\frac{1}{2^{l(x)}}\leq\frac{p(x)}{2}=\overline{F}(x)-\overline{F}(x-1)$
$L=\sum p(x)l(x)<H(X)+2$

Random Variable Generation

fair coin toss: $Z_1,\cdots, Z_n$
expected number of fair bits $E(T)$
generate variable: leaves are given symbols of $X$
$E(T)\geq H(X)$
- dyadic distribution: $E(T)=H(X)$
- otherwise: use binary expansions for the probabilities, $H(X)\leq ET<H(X)+2$

Universal Source Coding

minimax redundancy: $R^*=\min_q\max_{p_0}R=\min_q\max_{p_0}D(p_\theta|q)$
Arithmetic Coding
LZ77: slidinig windows
LZ78: Trie