The minimum description length (MDL) principle is a formalization of "Occam's razor in which the best "hypothesis (a model and its parameters) for a given set of data is the one that leads to the best "compression of the data. MDL was introduced by "Jorma Rissanen in 1978. It is an important concept in "information theory and "computational learning theory.
There are various statistical methods for learning the parameters of a model using some data. Usually they assume that the general form of the model is fixed. MDL is a principle for selecting the general form of a model and its parameters. The model and a particular set of parameters is called a hypothesis. The philosophy behind MDL is that all statistical learning is about finding regularities in data. The best hypothesis to describe the regularities in data is also the one that is able to compresses the data most.
[The MDL Principle] is based on the following insight: any regularity in a given set of data can be used to "compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally. (Grünwald, 1998)
To select the hypothesis that captures the most regularity in the data, scientists look for the hypothesis with which the best compression can be achieved. In order to do this, a "code is fixed to compress the data, most generally with a ("Turing-complete) "computer "language. A "program to output the data is written in that language; thus the program effectively represents the data. The length of the shortest program that outputs the data is called the "Kolmogorov complexity of the data. This is the central idea of "Ray Solomonoff's "idealized theory of inductive inference.
However, this mathematical theory does not provide a practical way of reaching an inference. The most important reasons for this are:
MDL attempts to remedy these by:
Rather than "programs", in MDL theory one usually speaks of candidate hypotheses, models or codes. The set of allowed codes is then called the model class (some authors refer to the model class as the model). The code is then selected for which the sum of the description of the code and the description of the data using the code is minimal.
One of the important properties of MDL methods is that they provide a natural safeguard against "overfitting, because they implement a tradeoff between the complexity of the hypothesis (model class) and the complexity of the data given the hypothesis. An illustration is given in the following example.
This section has multiple issues. Please help improve it or discuss these issues on the "talk page. ("Learn how and when to remove these template messages)("Learn how and when to remove this template message)
A coin is flipped 1000 times, and the numbers of heads and tails are recorded. Consider two model classes:
For this reason a naive statistical method might choose the second model as a better explanation for the data. However, an MDL approach would construct a single code based on the hypothesis, instead of just using the best one. To do this, it is simplest to use a two-part code in which the element of the model class with the best performance is specified. Then the data is specified using that code. A lot of bits are needed to specify which code to use, thus the total codelength based on the second model class could be larger than 1000 bits. Therefore, the conclusion when following an MDL approach is inevitably that there is not enough evidence to support the hypothesis of the biased coin, even though the best element of the second model class provides better fit to the data.
Central to MDL theory is the "one-to-one correspondence between code length "functions and "probability distributions (this follows from the "Kraft–McMillan inequality). For any probability distribution , it is possible to construct a code such that the length (in bits) of is equal to ; this code minimizes the expected code length. Conversely, given a code , one can construct a probability distribution such that the same holds. (Rounding issues are ignored here.) In other words, searching for an efficient code is equivalent to searching for a good probability distribution.
MDL is very strongly connected to "probability theory and "statistics through the correspondence between codes and probability distributions mentioned above. This has led some researchers to view MDL as equivalent to "Bayesian inference: code length of model and data together in MDL correspond respectively to "prior probability and "marginal likelihood in the Bayesian framework.
While Bayesian machinery is often useful in constructing efficient MDL codes, the MDL framework also accommodates other codes that are not Bayesian. An example is the Shtarkov normalized maximum likelihood code, which plays a central role in current MDL theory, but has no equivalent in Bayesian inference. Furthermore, Rissanen stresses that we should make no assumptions about the true "data-generating process: in practice, a model class is typically a simplification of reality and thus does not contain any code or probability distribution that is true in any objective sense. In the last mentioned reference Rissanen bases the mathematical underpinning of MDL on the "Kolmogorov structure function.
According to the MDL philosophy, Bayesian methods should be dismissed if they are based on unsafe "priors that would lead to poor results. The priors that are acceptable from an MDL point of view also tend to be favored in so-called "objective Bayesian analysis; there, however, the motivation is usually different.
MDL was not the first "information-theoretic approach to learning; as early as 1968 Wallace and Boulton pioneered a related concept called "minimum message length (MML). The difference between MDL and MML is a source of ongoing confusion. Superficially, the methods appear mostly equivalent, but there are some significant differences, especially in interpretation: