Prefix code

A prefix code is a type of

prefix (initial segment) of any other code word in the system. It is trivially true for fixed-length code, so only a point of consideration in variable-length code

.

For example, a code with code words {9, 55} has the prefix property; a code consisting of {9, 5, 59, 55} does not, because "5" is a prefix of "59" and also of "55". A prefix code is a

uniquely decodable code

: given a complete and accurate sequence, a receiver can identify each word without requiring a special marker between words. However, there are uniquely decodable codes that are not prefix codes; for instance, the reverse of a prefix code is still uniquely decodable (it is a suffix code), but it is not necessarily a prefix code.

Prefix codes are also known as prefix-free codes, prefix condition codes and instantaneous codes. Although Huffman coding is just one of many algorithms for deriving prefix codes, prefix codes are also widely referred to as "Huffman codes", even when the code was not produced by a Huffman algorithm. The term comma-free code is sometimes also applied as a synonym for prefix-free codes^[1]^[2] but in most mathematical books and articles (e.g.^[3]^[4]) a comma-free code is used to mean a self-synchronizing code, a subclass of prefix codes.

Using prefix codes, a message can be transmitted as a sequence of concatenated code words, without any

frame

the words in the message. The recipient can decode the message unambiguously, by repeatedly finding and removing sequences that form valid code words. This is not generally possible with codes that lack the prefix property, for example {0, 1, 10, 11}: a receiver reading a "1" at the start of a code word would not know whether that was the complete code word "1", or merely the prefix of the code word "10" or "11"; so the string "10" could be interpreted either as a single codeword or as the concatenation of the words "1" then "0".

The variable-length

instruction sets

(machine language) of most computer microarchitectures are prefix codes.

Prefix codes are not

channel coding

(including error correction) before transmission.

For any

Kraft's inequality characterizes the sets of code word lengths that are possible in a uniquely decodable code.^[6]

Techniques

If every word in the code has the same length, the code is called a fixed-length code, or a block code (though the term

UTF-32/UCS-4 letters are always 32 bits long. ATM cells

are always 424 bits (53 bytes) long. A fixed-length code of fixed length k bits can encode up to

2^{k}

source symbols.

A fixed-length code is necessarily a prefix code. It is possible to turn any code into a fixed-length code by padding fixed symbols to the shorter prefixes in order to meet the length of the longest prefixes. Alternately, such padding codes may be employed to introduce redundancy that allows autocorrection and/or synchronisation. However, fixed length encodings are inefficient in situations where some words are much more likely to be transmitted than others.

Truncated binary encoding is a straightforward generalization of fixed-length codes to deal with cases where the number of symbols n is not a power of two. Source symbols are assigned codewords of length k and k+1, where k is chosen so that 2^k < n ≤ 2^k+1.

entropy encoding

.

Some codes mark the end of a code word with a special "comma" symbol (also called a Sentinel value), different from normal data.^[7] This is somewhat analogous to the spaces between words in a sentence; they mark where one word ends and another begins. If every code word ends in a comma, and the comma does not appear elsewhere in a code word, the code is automatically prefix-free. However, reserving an entire symbol only for use as a comma can be inefficient, especially for languages with a small number of symbols. Morse code is an everyday example of a variable-length code with a comma. The long pauses between letters, and the even longer pauses between words, help people recognize where one letter (or word) ends, and the next begins. Similarly, Fibonacci coding uses a "11" to mark the end of every code word.

Self-synchronizing codes are prefix codes that allow frame synchronization.

Related concepts

A suffix code is a set of words none of which is a suffix of any other; equivalently, a set of words which are the reverse of a prefix code. As with a prefix code, the representation of a string as a concatenation of such words is unique. A bifix code is a set of words which is both a prefix and a suffix code.^[8] An optimal prefix code is a prefix code with minimal average length. That is, assume an alphabet of $n$ symbols with probabilities $p(A_{i})$ for a prefix code $C$ . If $C'$ is another prefix code and $\lambda '_{i}$ are the lengths of the codewords of $C'$ , then $\sum _{i=1}^{n}{\lambda _{i}p(A_{i})}\leq \sum _{i=1}^{n}{\lambda '_{i}p(A_{i})}\!$ .^[9]

Prefix codes in use today

Examples of prefix codes include:

variable-length Huffman codes
country calling codes
Chen–Ho encoding
the country and publisher parts of ISBNs
the Secondary Synchronization Codes used in the
W-CDMA
3G Wireless Standard
VCR Plus+ codes
Unicode Transformation Format, in particular the UTF-8 system for encoding Unicode characters, which is both a prefix-free code and a self-synchronizing code^[10]

variable-length quantity

Techniques

Commonly used techniques for constructing prefix codes include Huffman codes and the earlier Shannon–Fano codes, and universal codes such as:

Elias delta coding
Elias gamma coding
Elias omega coding
Fibonacci coding
Levenshtein coding
Unary coding
Golomb Rice code
Straddling checkerboard (simple cryptography technique which produces prefix codes)
binary coding^[11]

Notes

^ US Federal Standard 1037C
^ ATIS Telecom Glossary 2007, archived from the original on July 8, 2010, retrieved December 4, 2010
^ Berstel, Jean; Perrin, Dominique (1985), Theory of Codes, Academic Press
S2CID 124092269

^ Le Boudec, Jean-Yves, Patrick Thiran, and Rüdiger Urbanke. Introduction aux sciences de l'information: entropie, compression, chiffrement et correction d'erreurs. PPUR Presses polytechniques, 2015.

^ Berstel et al (2010) p.75

^ A. Jones, J. "Development of Trigger and Control Systems for CMS" (PDF). High Energy Physics, Blackett Laboratory, Imperial College, London. p. 70. Archived from the original (PDF) on Jun 13, 2011.

^ Berstel et al (2010) p.58

^ McGill COMP 423 Lecture notes

^ Pike, Rob (2003-04-03). "UTF-8 history".

doi:10.25209/2079-3316-2018-9-4-239-252

References

Berstel, Jean; Perrin, Dominique; Reutenauer, Christophe (2010). Codes and automata. Encyclopedia of Mathematics and its Applications. Vol. 129. Cambridge:
Zbl 1187.94001
.

Zbl 0298.94011
.

D.A. Huffman, "A method for the construction of minimum-redundancy codes", Proceedings of the I.R.E., Sept. 1952, pp. 1098–1102 (Huffman's original article)

Profile: David A. Huffman, Scientific American, Sept. 1991, pp. 54–58 (Background story)

ISBN 0-262-03293-7
. Section 16.3, pp. 385–392.

This article incorporates public domain material from Federal Standard 1037C. General Services Administration. Archived from the original on 2022-01-22.

External links

Codes, trees and the prefix property by Kona Macphee

v
t
e
Data compression methods
Lossless
Entropy type

Adaptive coding

Arithmetic

Asymmetric numeral systems

Golomb

Huffman
Adaptive

Canonical

Modified

Range

Shannon

Shannon–Fano

Shannon–Fano–Elias

Tunstall

Unary

Universal
Exp-Golomb

Fibonacci

Gamma

Levenshtein

Dictionary type

Byte pair encoding

Lempel–Ziv
842

LZ4

LZJB

LZO

LZRW

LZSS

LZW

LZWL

Snappy

Other types

BWT

CTW

CM

Delta
Incremental

DMC

DPCM

Grammar
Re-Pair

Sequitur

LDCT

MTF

PAQ

PPM

RLE

Hybrid

LZ77 + Huffman
Deflate

LZX

LZS

LZ77 + ANS
LZFSE

LZ77 + Huffman + ANS
Zstandard

LZ77 + Huffman + context
Brotli

LZSS + Huffman
LHA/LZH

LZ77 + Range
LZMA

LZHAM

bzip2 (RLE + BWT + MTF + Huffman)

Lossy
Transform type

Discrete cosine transform
DCT

MDCT

DST

FFT

Wavelet
Daubechies

DWT

SPIHT

Predictive type

DPCM
ADPCM

LPC
ACELP

CELP

LAR

LSP

WLPC

Motion
Compensation

Estimation

Vector

Psychoacoustic

Audio
Concepts

Bit rate
ABR

CBR

VBR

Companding

Convolution

Dynamic range

Latency

Nyquist–Shannon theorem

Sampling

Silence compression

Sound quality

Speech coding

Sub-band coding

Codec parts

A-law

μ-law

DPCM
ADPCM

DM

FT
FFT

LPC
ACELP

CELP

LAR

LSP

WLPC

MDCT

Psychoacoustic model

Image
Concepts

Chroma subsampling

Coding tree unit

Color space

Compression artifact

Image resolution

Macroblock

Pixel

PSNR

Quantization

Standard test image

Texture compression

Methods

Chain code

DCT

Deflate

Fractal

KLT

LP

RLE

Wavelet
Daubechies

DWT

EZW

SPIHT

Video
Concepts

Bit rate
ABR

CBR

VBR

Display resolution

Frame

Frame rate

Frame types

Interlace

Video characteristics

Video quality

Codec parts

DCT

DPCM

Deblocking filter

Lapped transform

Motion
Compensation

Estimation

Vector

Wavelet
Daubechies

DWT

Theory

Compressed data structures
Compressed suffix array

FM-index

Entropy

Information theory
Timeline

Kolmogorov complexity

Prefix code

Quantization

Rate–distortion

Redundancy

Symmetry

Smallest grammar problem

Community

Hutter Prize

Global Data Compression Competition

encode.su

People

Matt Mahoney

Mark Adler

Compression formats

Compression software (codecs)

Retrieved from "https://en.wikipedia.org/w/index.php?title=Prefix_code&oldid=1190449330"

[1] US Federal Standard 1037C

[2] ATIS Telecom Glossary 2007, archived from the original on July 8, 2010, retrieved December 4, 2010

[3] Berstel, Jean; Perrin, Dominique (1985), Theory of Codes, Academic Press

[4] S2CID 124092269

[LTU2015-5] Le Boudec, Jean-Yves, Patrick Thiran, and Rüdiger Urbanke. Introduction aux sciences de l'information: entropie, compression, chiffrement et correction d'erreurs. PPUR Presses polytechniques, 2015.

[BRS75-6] Berstel et al (2010) p.75

[7] A. Jones, J. "Development of Trigger and Control Systems for CMS" (PDF). High Energy Physics, Blackett Laboratory, Imperial College, London. p. 70. Archived from the original (PDF) on Jun 13, 2011.

[BPR58-8] Berstel et al (2010) p.58

[9] McGill COMP 423 Lecture notes

[10] Pike, Rob (2003-04-03). "UTF-8 history".

[11] doi:10.25209/2079-3316-2018-9-4-239-252

[1]

[2]

[3]

[4]

[6]

[7]

[8]

[9]

[10]

[11]