Prediction by partial matching

Prediction by partial matching (PPM) is an adaptive

context modeling and prediction. PPM models use a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the stream. PPM algorithms can also be used to cluster data into predicted groupings in cluster analysis

.

Theory

Predictions are usually reduced to symbol rankings^[clarification needed]. Each symbol (a letter, bit or any other amount of data) is ranked before it is compressed, and the ranking system determines the corresponding codeword (and therefore the compression rate). In many compression algorithms, the ranking is equivalent to probability mass function estimation. Given the previous letters (or given a context), each symbol is assigned with a probability. For instance, in arithmetic coding the symbols are ranked by their probabilities to appear after previous symbols, and the whole sequence is compressed into a single fraction that is computed according to these probabilities.

The number of previous symbols, n, determines the order of the PPM model which is denoted as PPM(n). Unbounded variants where the context has no length limitations also exist and are denoted as PPM*. If no prediction can be made based on all n context symbols, a prediction is attempted with n − 1 symbols. This process is repeated until a match is found or no more symbols remain in context. At that point a fixed prediction is made.

Much of the work in optimizing a PPM model is handling inputs that have not already occurred in the input stream. The obvious way to handle them is to create a "never-seen" symbol which triggers the

pseudocount

of one. A variant called PPMd increments the pseudocount of the "never-seen" symbol every time the "never-seen" symbol is used. (In other words, PPMd estimates the probability of a new symbol as the ratio of the number of unique symbols to the total number of symbols observed).

Implementation

PPM compression implementations vary greatly in other details. The actual symbol selection is usually recorded using

Huffman encoding or even some type of dictionary coding

technique. The underlying model used in most PPM algorithms can also be extended to predict multiple symbols. It is also possible to use non-Markov modeling to either replace or supplement Markov modeling. The symbol size is usually static, typically a single byte, which makes generic handling of any file format easy.

Published research on this family of algorithms can be found as far back as the mid-1980s. Software implementations were not popular until the early 1990s because PPM algorithms require a significant amount of

RAM. Recent PPM implementations are among the best-performing lossless compression programs for natural language

text.

PPMd is a public domain implementation of PPMII (PPM with information inheritance) by Dmitry Shkarin which has undergone several incompatible revisions.

zip

file formats.

Attempts to improve PPM algorithms led to the PAQ series of data compression algorithms.

A PPM algorithm, rather than being used for compression, is used to increase the efficiency of user input in the alternate input method program Dasher.

Sources

Cleary, J.; Witten, I. (April 1984). "Data Compression Using Adaptive Coding and Partial String Matching". doi:10.1109/TCOM.1984.1096090
.

Moffat, A. (November 1990). "Implementing the PPM data compression scheme". doi:10.1109/26.61469
.

Cleary, J. G.; Teahan, W. J.; Witten, I. H. (1997). "Unbounded length contexts for PPM". The Computer Journal. 40 (2_and_3). Oxford, England: Oxford University Press: 67–75.
ISSN 0010-4620
.

C. Bloom, Solving the problems of context modeling.

W.J. Teahan, Probability estimation for PPM, Original Source from archive.org.

Schürmann, T.; Grassberger, P. (September 1996). "Entropy estimation of symbol sequences". Chaos. 6 (3): 414–427.
S2CID 10090433
.

References

^ "BMF, PPMd Всё о сжатии данных, изображений и видео". compression.ru (in Russian). NOTE: requires manually setting the "Cyrillic (Windows)" encoding in browser.

External links

This article's use of external links may not follow Wikipedia's policies or guidelines. Please improve this article by removing excessive or inappropriate external links, and converting useful links where appropriate into footnote references. (November 2015) (Learn how and when to remove this message)

Suite of PPM compressors with benchmarks

BICOM, a bijective PPM compressor Archived 2004-04-15 at the Wayback Machine

"Arithmetic Coding + Statistical Modeling = Data Compression", Part 2

(in Russian) PPMd compressor by Dmitri Shkarin

PPM compression in C++ by René Puschinger

v
t
e
Data compression methods
Lossless
Entropy type

Adaptive coding

Arithmetic

Asymmetric numeral systems

Golomb

Huffman
Adaptive

Canonical

Modified

Range

Shannon

Shannon–Fano

Shannon–Fano–Elias

Tunstall

Unary

Universal
Exp-Golomb

Fibonacci

Gamma

Levenshtein

Dictionary type

Byte pair encoding

Lempel–Ziv
842

LZ4

LZJB

LZO

LZRW

LZSS

LZW

LZWL

Snappy

Other types

BWT

CTW

CM

Delta
Incremental

DMC

DPCM

Grammar
Re-Pair

Sequitur

LDCT

MTF

PAQ

PPM

RLE

Hybrid

LZ77 + Huffman
Deflate

LZX

LZS

LZ77 + ANS
LZFSE

LZ77 + Huffman + ANS
Zstandard

LZ77 + Huffman + context
Brotli

LZSS + Huffman
LHA/LZH

LZ77 + Range
LZMA

LZHAM

bzip2 (RLE + BWT + MTF + Huffman)

Lossy
Transform type

Discrete cosine transform
DCT

MDCT

DST

FFT

Wavelet
Daubechies

DWT

SPIHT

Predictive type

DPCM
ADPCM

LPC
ACELP

CELP

LAR

LSP

WLPC

Motion
Compensation

Estimation

Vector

Psychoacoustic

Audio
Concepts

Bit rate
ABR

CBR

VBR

Companding

Convolution

Dynamic range

Latency

Nyquist–Shannon theorem

Sampling

Silence compression

Sound quality

Speech coding

Sub-band coding

Codec parts

A-law

μ-law

DPCM
ADPCM

DM

FT
FFT

LPC
ACELP

CELP

LAR

LSP

WLPC

MDCT

Psychoacoustic model

Image
Concepts

Chroma subsampling

Coding tree unit

Color space

Compression artifact

Image resolution

Macroblock

Pixel

PSNR

Quantization

Standard test image

Texture compression

Methods

Chain code

DCT

Deflate

Fractal

KLT

LP

RLE

Wavelet
Daubechies

DWT

EZW

SPIHT

Video
Concepts

Bit rate
ABR

CBR

VBR

Display resolution

Frame

Frame rate

Frame types

Interlace

Video characteristics

Video quality

Codec parts

DCT

DPCM

Deblocking filter

Lapped transform

Motion
Compensation

Estimation

Vector

Wavelet
Daubechies

DWT

Theory

Compressed data structures
Compressed suffix array

FM-index

Entropy

Information theory
Timeline

Kolmogorov complexity

Prefix code

Quantization

Rate–distortion

Redundancy

Symmetry

Smallest grammar problem

Community

Hutter Prize

People

Mark Adler

Compression formats

Compression software (codecs)

Retrieved from "https://en.wikipedia.org/w/index.php?title=Prediction_by_partial_matching&oldid=1221560949"

[1] "BMF, PPMd Всё о сжатии данных, изображений и видео". compression.ru (in Russian). NOTE: requires manually setting the "Cyrillic (Windows)" encoding in browser.

Theory

Implementation

See also

Sources

References

External links