# Failing Expectations with the Maximum Entropy Principle

This post is on the heels of a paper1 we recently published (with Éva Czabarka) on the degeneracy problem of the Maximum Entropy Method. The post is fairly long as it also provides a tutorial on the topic.

When we are trying to understand a system or predict its behavior, we usually do that based on limited information. In everyday situations we make our decisions and choices using some sort of “inner” probability distribution shaped by past experience. However, human decision-making tends to be biased; just this wiki page enlists over 150 such biases. Naturally, this raises the question whether there is a way to make in-silico, unbiased quantitative inferences based on limited information.

Edwin T. Jaynes, 1922-98
(Washington U. Libraries)

The father of a principled and quantitative approach to unbiased inference making is physicist Edwin T. Jaynes. Starting off his career under the tutelage of none other than Eugene (Jenő) Wigner, he brought significant contributions in several areas of physics, most notably in statistical mechanics and its applications.  Standing on the shoulders of Laplace, Bayes and Shannon, Jaynes started a revolution in statistical inference with his two seminal papers2,3 from 1957.  He based his inference method on the notion of statistical ensembles, much in the same way that statistical mechanics uses the ensembles of microstates to describe macroscopic properties of matter.

In his two celebrated papers Jaynes has re-derived many results from both equilibrium statistical mechanics and the time-dependent density matrix formalism of quantum physics using only the Maximum Entropy Principle. The plot below shows what it means to write papers for the ages: his two papers from 1957 have been accumulating citations exponentially, with citations doubling roughly every ten years.

Citations to Jaynes’s 1957 papers over the years.
Source: Web of Science; Thomson-Reuters.

##### The Maximum Entropy (MaxEnt) Principle

The MaxEnt is based on a simple premise: Given the available information about a system, make predictions using only that information and without introducing additional biases. Tautologically speaking, once the available information has been incorporated, there is no information left for us to exploit. It is as if trying to predict the throw of a fair coin, or the upside of a thrown fair dice. Clearly the uniform choice will be the best choice in this case, but to express this in a general and quantitative way, following Jaynes, one uses the notion of information theoretic entropy introduced by Claude Shannon. Let $\Omega$ denote the space of all possible states of our system of interest. A particular state $\boldsymbol{\mu} \in \Omega$ of the system (or microstate in physics parlance) specifies all the variables necessary to uniquely determine the state. For our purposes it is sufficient to think of $\Omega$ as a discrete collection of states, but in general can be a continuous space as well. Our predictions about the system will be in the form of a probability distribution $P(\boldsymbol{\mu})$ over $\Omega$. One can think of the available information as constraints on the set of microstates that the system can be in: without any information we cannot say anything about the state of the system and thus any state is fair game with the same probability, i.e., $P(\boldsymbol{\mu})=|\Omega|^{-1}$ and thus the uncertainty is maximum. As information is added, it changes the distribution to something else; if we happen to know the precise state of the system $\boldsymbol{\mu}_0$, then $P(\boldsymbol{\mu})=1$ for $\boldsymbol{\mu}=\boldsymbol{\mu}_0$ and $P(\boldsymbol{\mu})=0$, for $\boldsymbol{\mu}\neq\boldsymbol{\mu}_0$, and in this case there is no uncertainty.

Shannon defines information as the amount of surprise we get when some event $\boldsymbol{\mu}$ actually happens, whose probability of happening is otherwise $P(\boldsymbol{\mu})$. Without going into details, the information content in the event $\boldsymbol{\mu}$ happening is defined as $I(\boldsymbol{\mu})=-\log{P(\boldsymbol{\mu})}$. If $\boldsymbol{\mu}$ is the event of snow falling in San Diego then when that actually happens is very surprising (large $I$) and even CNN will talk about it, because $P(\boldsymbol{\mu})$ is very small (last happened in 1967). The same event happening at the North Pole, however,  carries no surprise ($P(\boldsymbol{\mu})$ close to unity, small $I$). As a functional of the distribution $P$, Shannon’s entropy $$S[P] = \langle I(\boldsymbol{\mu})\rangle = \; – \sum_{\boldsymbol{\mu} \in \Omega} P(\boldsymbol{\mu}) \log {P(\boldsymbol{\mu})} \label{entropy}$$ expresses the average uncertainty induced by the distribution $P$. Influenced by his statistical physics background, Jaynes incorporates the available information in form of ensemble averages of observables $m_i(\boldsymbol{\mu})$, $i=1,\ldots,K$ (here $K$ is the number of constraints): $$\overline{m}_i= \sum_{\boldsymbol{\mu} \in \Omega} m_i(\boldsymbol{\mu}) P(\boldsymbol{\mu})\,,\;\;\;i=1,\ldots,K\,. \label{mi}$$ The $\overline{m}_i$ represent values that we know about the system (e.g., we measured them, someone gave them to us, or we simply assumed them). This form in expectation of the constraints is rather natural in statistical physics, where information is in the form of experimentally accessible macroscopic variables (e.g., pressure, temperature, magnetization, etc.), which are averages of microscopic ones. Thus, we are looking to find that distribution $P(\boldsymbol{\mu})$ which obeys the constraints \eqref{mi}, it is normalized $\sum_{\boldsymbol{\mu}\in \Omega} P(\boldsymbol{\mu}) = 1$ and beyond that it is the least biased, or maximally uncertain, i.e., for which entropy \eqref{entropy} is maximal. As Jaynes puts it, this is the distribution that is the least committal towards the unavailable information. Using the standard method of Lagrange multipliers we then maximize \eqref{entropy} subject to \eqref{mi} and normalization and obtain the parametric family of Gibbs distributions: $$P(\boldsymbol{\mu};\boldsymbol{\beta}) =\frac{1}{Z(\boldsymbol{\beta})} e^{-\boldsymbol{\beta}\cdot\boldsymbol{m}(\boldsymbol{\mu})} \label{gibbs}$$ where $$Z(\boldsymbol{\beta}) = \sum_{\boldsymbol{\mu}\in \Omega} e^{-\boldsymbol{\beta}\cdot\boldsymbol{m}(\boldsymbol{\mu})} \label{pf}$$ is the normalization factor, also called partition function. The $\boldsymbol{\beta} = (\beta_1,\ldots,\beta_K)$ Lagrange multipliers are then determined from solving \eqref{mi} with the form of \eqref{gibbs} replacing the $P(\boldsymbol{\mu})$ to obtain unique values $\boldsymbol{\beta} = \boldsymbol{\beta}(\overline{m}_1,\ldots,\overline{m}_K)$. This step is also called fitting because these are usually nasty equations that can only be solved numerically.

##### Throwing 3-sided dice

As a simple example, consider a 3-sided dice. (Take a standard dice and repeat every number 1,2,3 twice on its opposite sides, or use an artistic dice like the one by Nvenom8 shown in the figure.) Thus $\Omega = \{1,2,3\}$. Assume that after throwing the dice very many times Alice reports to Bob the average value of the upside face as $\overline{m} = 2.5$. Based on this information only, Bob wants to compute the MaxEnt probabilities $P(1)$, $P(2)$, $P(3)$ of the topside face. In this case Eq. \eqref{mi} is  $P(1) + 2 P(2) + 3 P(3) = \overline{m}$ and  with normalization $P(1) + P(2) + P(3) = 1$ we have one more unknowns than equations. The MaxEnt distribution will simply be $P(m; \beta) = e^{- m\beta}/Z(\beta)$, $m=1,2,3$ and $Z(\beta) = e^{-\beta} + e^{-2\beta} + e^{-3 \beta}$. Eq. \eqref{mi} becomes a quadratic equation for $e^{-\beta}$ with the only valid solution of: $$e^{-\beta} = \frac{\overline{m}-2+\sqrt{\Delta}}{2(3-\overline{m})}\,,\;\;\Delta = -3\left(\overline{m}\right)^2+12\overline{m}-8$$ leading to $P(1) = \frac{1}{6}(10-3\overline{m}-\sqrt{\Delta})$, $P(2) = \frac{1}{3}(\sqrt{\Delta}-1)$ and $P(3) = \frac{1}{6}(3\overline{m}-2-\sqrt{\Delta})$. In particular, for $\overline{m} = 2.5$ we get $P(1) \simeq 0.116$, $P(2) \simeq 0.267$ and $P(3) \simeq 0.617$, indicating the most likely biases in the dice. The plots of these three probabilities as function of $\overline{m}$ are shown below.

MaxEnt probabilities for the three-sided dice.

What if Alice reports $\overline{m} = 2$? In this case $P(1)=P(2)=P(3) = 1/3$, which are the same values as for an unbiased dice. Does this mean that the dice is fair? If the only information we have is the average topside face value, we cannot say whether the dice is biased, only that the available information does not contradict the assumption that it is a fair dice. To see this, assume that Alice reports in addition, also the average of the square of the upside face value, i.e., $\overline{m^2}$. In this case we have full information, namely, as many unknowns as independent equations and we find that $P(1) = \frac{1}{6}(6-5\overline{m}+\overline{m^2})$, $P(2) = 4 \overline{m} – \overline{m^2} – 3$ and $P(3) =\frac{1}{2}(2-3\overline{m}+\overline{m^2})$. Thus, if Alice reports that $\overline{m} = 2$ (same as it would be for a fair dice) but $\overline{m^2} = 5$ then $P(1)=P(2) = 1/2$ and $P(3) = 0$, and we actually have a heavily biased dice.

For a detailed account on the MaxEnt method, its dynamical version MaxCal and for their many applications see the nice review4 by Pressé et al.

##### The degeneracy problem

First, a disclaimer: All social networks in the following description are purely fictitious, and any resemblance to a network of specific living, dead or undead people would only be possible by a significant stretch of the imagination. Intelligence agencies have discovered that a new extremist but nefarious and intolerant ideology is secretly taking hold across the globe. From indirect evidence and intercepted, but heavily encrypted communications it was found that there are $N=9$ cells involved in nefarious activities all over the world under the aegis of this new ideology. Based on their data, the defense analysts expect that $\overline{M}=17$ pairs of cells interact regularly and that $\overline{T} = 19$ different triplets of cells have collaborated at various times to realize their attacks. However, it is not known specifically which cells are interacting with one another and which cells participated in which activity at any time. The analysts want to know the likelihood that the 9 cells form a single connected network and they also want to find the most likely network that the available information $\left(N,\overline{M},\overline{T}\right)$ is consistent with.

This problem looks ripe for the MaxEnt method. In this case the set of microstates $\Omega$ is formed by all the simple, labeled graphs $G \in \Omega$ (here $G$ replaces $\boldsymbol{\mu}$ in the above) on $N=9$ nodes. There are $|\Omega| = 2^{N(N-1)/2} = 68\,719\,476\,736$ such graphs.

The two constraints in expectation are the number of edges $\overline{M}=17$ and the number of triangles $\overline{T} = 19$. Since the number of nodes is small, in this case we can exactly enumerate (in-silico) all the simple, labeled graphs (graphs from here on) with a given number of edges and triangles and thus compute exactly all the probabilities involved in Jaynes’s method. If $\beta$ is the multiplier corresponding to the number of edges and $\gamma$ for the number of triangles, after solving \eqref{mi} for $\beta, \gamma$, one finds $\beta = 1.3017$ and $\gamma = -0.6325$ and the rather peculiar distribution shown below.

Distribution of graphs with given number of edges in the MaxEnt model for the nefarious network of cells.

It seems as if the MaxEnt method is confused and cannot make up its mind: the microstates (graphs) with the highest probabilities (two peaks in the figure) are either very sparse (few edges) or very dense (almost all connections are there). The average number of edges is certainly respected from the combination of the two types in the sum \eqref{mi}, but the individual realizations/microstates are rarely near the expected value. Moreover, in one case (sparse) the network is disconnected into pieces, whereas in the other (dense) forms a single component, and both are highly probable. But the situation is even worse: since in this case we can enumerate everything, we find that there are $64\,260$ (labeled) graphs with exactly 17 edges and 19 triangles but none of them is fully connected. Yet, the MaxEnt method tells us that there are a significant number of connected graphs! More precisely, the probability that a graph sampled by the MaxEnt method will be fully connected is 0.6. None of the connected graphs (of the 60%), however, has 17 edges and 19 triangles, so they don’t have the observed property at all. This answer is completely useless for our analysts.   This effect is called degeneracy, and has been known for nearly three decades, since the works of David Strauss5 in social networks. Other examples of degeneracy are shown in our paper, one involving the now famous karate-club dataset by constraining the number of edges and two-stars, another involving a network of jazz musicians, where we constrain three quantities in expectation: edges, two-stars and triangles and a third example involving groups of permutations (related, e.g., to predicting outcomes of horse races).

Note that the constraining values $\overline{M},\overline{T}$ could have come from an actual realization/state of the system from $\Omega$. A degenerate distribution in this case may even predict a near zero probability for the realization from which the data was obtained!

Historically, in degenerate cases people simply have thrown away the existing data and tried choosing other types of measures and data for it. With some luck, degeneracy would be avoided. This is certainly unsatisfactory, as we might not be able to collect new data on other types of measures (as in the case of the nefarious network of cells above), or the measures used are of particular interest in the field (like triangles in sociology). We thus have to have a method that maximally exploits the data that is available to us, instead of avoiding them.

##### So what causes degeneracy?

To answer that, let us first make the observation from \eqref{gibbs} and \eqref{pf} that all microstates (graphs in our case) that share the same values for the measures $\boldsymbol{m}$ will all have the same MaxEnt probability. So instead of summing over the space of microstates (which is rather uncanny because it is huge and usually we know little about it), we can gather all the terms (microstates) in the sum with the same $\boldsymbol{m}$ into one group and then sum over the range of values for $\boldsymbol{m}$. In particular, the partition function \eqref{pf} becomes: $$Z(\boldsymbol{\beta}) = \sum_{\boldsymbol{m}} {\cal N}(\boldsymbol{m})\, e^{-\boldsymbol{\beta}\cdot\boldsymbol{m}} \label{pf1}$$ where ${\cal N}(\boldsymbol{m})$ is the number of microstates with given values for the measures $\boldsymbol{m}$. In physics this function is called the density of states (d.o.s) function. Thus, the averages \eqref{mi} can be expressed as $$\overline{m}_i = \sum_{m_i = – \infty}^{\infty} m_i\,p(\boldsymbol{m};\boldsymbol{\beta})\,\;\;\;i=1,\ldots,K\,, \label{mib}$$ where $$p(\boldsymbol{m};\boldsymbol{\beta})=\frac{{\cal N}(\boldsymbol{m})}{Z(\boldsymbol{\beta})}\,e^{-\boldsymbol{\beta}\cdot\boldsymbol{m}}\,. \label{pmb}$$ Note that in all the above the $\boldsymbol{m}$ appear as free variables. Eq. \eqref{pmb} expresses the probability that the MaxEnt distribution generates (or samples) a microstate with given values for measures $\boldsymbol{m}$. Thus, in degenerate cases $p(\boldsymbol{m};\boldsymbol{\beta})$ shows two or more peaks as function of $\boldsymbol{m}$ (see our paper for a more precise description). Since the exponential in \eqref{pmb} is a monotonic function of $\boldsymbol{m}$ it then follows that the culprit behind degeneracy is the d.o.s function ${\cal N}(\boldsymbol{m})$. In the paper we prove the following theorem:

Theorem: A MaxEnt model is non-degenerate if and only if ${\cal N}(\boldsymbol{m})$ is log-concave.

A function is log-concave if the log of the function, that is $\log\!\left({\cal N}(\boldsymbol{m})\right)$ is concave. The fact that we can pinpoint the cause of degeneracy is the good news. The bad news is that it is related to properties of the function ${\cal N}(\boldsymbol{m})$. It is a counting function, giving the number of objects in some space that all share the same property, for example in our case the number of simple graphs on $N$ nodes with a given number of edges and a given number of triangles (precisely that many, not on average!). This is a purely mathematical/combinatorial object and has nothing to do with MaxEnt, probabilities or inference, however, its properties determine whether the MaxEnt method will show degeneracy. The bad news is that counting problems are among the hardest problems in discrete mathematics. Even those problems whose decision version is easy (they are in P), their counting version can be very hard (from #P-complete). For example, deciding that there is perfect matching in a bipartite graph is easy (has a polynomial-time algorithm) but counting their number is very hard (no such algorithm).

The d.o.s. function on 9 nodes (color intensity) with given number of edges and triangles. The cross hair is at 17 edges and 19 triangles.

For a function ${\cal N}(\boldsymbol{m})$ to be log-concave, two conditions need to be fulfilled. On one hand the domain ${\cal D}$ of ${\cal N}(\boldsymbol{m})$ has to be geometrically convex and on the other, it has to fulfill the Prékopa-Leindler inequality (which is just the regular definition for concavity for the logarithm of a function). In many cases the first condition is broken, such as in our example of the nefarious network with ${\cal N}(\boldsymbol{m}) = {\cal N}(m_|,m_{\vartriangle})$ illustrated in the Figure to the right. Since the domain is curved, waxing-crescent shaped, the domain is non-convex (a straight segment with end-points close to the tips will have most of its points outside the domain). The next figure then shows in this 2D map the bimodal distribution of graphs $p(\boldsymbol{m};\boldsymbol{\beta})$ as sampled by the MaxEnt method.

The MaxEnt model samples the graphs according to a bimodal distribution (peaks are at darker blue regions).

The non-convexity of ${\cal D}$ comes from the fact that the variables, i.e., the components of  $\boldsymbol{m}$ are not independent, more precisely, they are non-linearly dependent from one another. In classical statistical mechanics the constraining variables (in expectation) are independent quantities characterizing different types of interactions with the environment such as pressure (mechanical interactions), chemical potential (chemical or particle exchange), temperature (thermal), magnetic susceptibility (magnetic), etc. In this case the domain of the d.o.s. function is always convex (a hyperrectangle) and degeneracy does not occur. Note that if the variables are linearly dependent, the domain is still convex and degeneracy does not occur in this case either (assuming the concavity condition for the log of the d.o.s. is valid).

##### Can we eliminate degeneracy?

Assume that we don’t have the possibility to switch to a different set of variables, but instead we have to work with the available measures and the data for them. Still using the same data, and without losing information we would like to make good predictions based on the MaxEnt approach. Let us call our original (degenerate) model (or ensemble) the MaxEnt($\boldsymbol{m}$) model (ensemble). The idea is to introduce a one-to-one mapping (so no information is lost) from the original variables (and the data for them) to new ones $\boldsymbol{m} \leftrightarrow \boldsymbol{\xi} = \boldsymbol{F}(\boldsymbol{m})$ such that the corresponding d.o.s. function (number of microstates with given $\boldsymbol{\xi}$), ${\cal N}’\left(\boldsymbol{\xi}\right)$ is now log-concave and thus the new model (ensemble) MaxEnt($\boldsymbol{\xi}$) is non-degenerate. Since $\boldsymbol{F}$ is bijective, the number of states with $\boldsymbol{\xi}$ is the same as the number of states with $\boldsymbol{m} = \boldsymbol{F}^{-1}\left( \boldsymbol{\xi}\right)$, i.e., ${\cal N}’\left(\boldsymbol{\xi}\right) = {\cal N}\left(\boldsymbol{F}^{-1}\left( \boldsymbol{\xi}\right)\right)$. The observation behind our idea is that while a function could be concave (or convex), its composition with another function, in the new variables can have altered convexity properties. Note that such mappings do not require the collection of new data! One simply applies the transformations on the data we have.

While we do not have a general method for finding such transformations, in concrete situations they can be done relatively easily. A simple example of degeneracy in our paper is the case when the constraining variables are the number of edges $m_{|}$, and the number of wedges (2-stars) $m_{\vee}$. Since on average the latter is proportional to the square of the former, we may choose $\xi_{|} = {m_{|}}^{2}$ and $\xi_{\vee} = m_{\vee}\$(the choice is not unique, see our paper). In our nefarious network example, we may choose $\xi_{|} = {m_{|}}^{3}$ and $\xi_{\vartriangle} = m_{\vartriangle}$. In all these cases the new model $\text{MaxEnt}(\boldsymbol{\xi})$ will be non-degenerate and based on the same input information with ${\overline{\xi}}_{i} = F_{i}\left( {\overline{m}}_{1},\dots,{\overline{m}}_{K} \right)$, $i = 1,\dots,K$ (chosen that way). Since $\text{MaxEnt}(\boldsymbol{\xi})$ is a new model, we may ask how far are the averages of $m_{i}$ within the new $\text{MaxEnt}(\boldsymbol{\xi})$ ensemble from the original data values ${\overline{m}}_{i}$. Since, for example, instead of requiring that the number of edges on average is 10, we now require that the square of the number of edges is 100 on average—we do not expect this to cause differences that are too big. Indeed, we can show that due to the fact that the new ensemble $\text{MaxEnt}(\boldsymbol{\xi})\$ is concentrated around $\overline{\boldsymbol{\xi}}$ , the differences are actually rather small (see the last paragraph on pg 3 of our published paper1), and the new model still respects \eqref{mi} with good approximation. The figure below shows the corrected MaxEnt distribution by edge-number (black) for the case of the nefarious network.

In the corrected model, $\text{MaxEnt}(\boldsymbol{\xi})$, typical samples are drawn from near the original data values.

Indeed, the samples are now coming with high probability from around the observation value of 17. The red shows the probability that a MaxEnt sampled graph with a given edge number will be disconnected and the green that it will be connected; now the MaxEnt predicts that 94% of sampled graphs will be disconnected, much better than previously. The next figure shows the MaxEnt distribution in 2D, in the corrected $\text{MaxEnt}(\boldsymbol{\xi})$ model; it is concentrated around the observation values. The plot below is in terms of the original variables; in terms of the new ones it would have a convex shape, see our paper.

The corrected $\text{MaxEnt}(\boldsymbol{\xi})$ distribution as function of the complete set of variables.

##### Discussion

We have shown that the root cause of degeneracy in MaxEnt models is the existence of non-linear constraining relationships between the variables. One can eliminate degeneracy by using a one-to-one nonlinear mapping to a set of new variables (using the same data), and defining a MaxEnt model based on these new variables. The new MaxEnt model will be generating samples with high probability from the neighborhood of microstates for which the given constraints are typical, and the mean values of the original variables in this new model will be close to the original input data.

A few comments on MaxEnt models are in order, notwithstanding degeneracy issues.

A. The input data may be applied within a frequentist or a Bayesian framework. The data may come from several repeat observations and the averages represent actual averages of those observations (frequentist). Or, based on our best knowledge we may believe that we can interpret the single dataset available to us as representing typical values and thus we choose to treat them as expected values (Bayesian). In the latter case we will actually generate a MaxEnt ensemble within which the input values are indeed typical. However, if the data came from an atypical or special state of the system, then our predictions will be off when compared against new empirical data for those predictions. When that happens we know that our view of the system is off and thus we can use the MaxEnt as a tool to help update our understanding of a complex system — more on this in the next point.

B. Given a set of constraining measures $\boldsymbol{m}$ we sometimes want to predict the value (in expectation) of another measure (such as the number of triangles knowing the number of edges and 2-stars), or even its distribution. This can only be done to the extent that the constraining measures determine the variable of interest. However, if the variable of interest is independent from the constraining ones, the MaxEnt model will be useless for predicting that variable, because the information we have says nothing about what we are interested in. Jaynes makes the observation that this aspect is actually a strength of the MaxEnt method: if our predictions are way off for a quantity of interest when compared against empirical data, it means that we have not included the relevant variables into our model for that quantity. Unfortunately, this does not say what are the relevant variables, just that the ones used are irrelevant.

C. The output of the MaxEnt model is the distribution \eqref{gibbs} and we have to sample microstates from $\Omega$ according to this distribution. This, in itself, is a thorny issue and one typically does it using a Markov Chain Monte Carlo algorithm. As a matter of fact this step is a computational bottleneck that limits MaxEnt applications to relatively small sets of data.

D. One of the original criticisms of Jaynes’s method was its subjective (Bayesian) nature. Namely, physical reality does not depend on our state of knowledge and thus it brought into question the method that maximizing the information entropy (which expresses the uncertainty in *our* knowledge about a system) can be used to generate predictions about physical reality. However, since then several authors, starting with Shore and Johnson have demonstrated that any algorithm that draws inferences about probability distributions in a self-consistent and unbiased way will be maximizing the entropy function \eqref{entropy}—see the review4 by Pressé et al.

##### Open problems

1. Is there a general method/algorithm that computes/generates the mapping $\boldsymbol{F}$ when $\mathcal{D}$ is simply connected such that in the new variables the counting function is log-concave? In 2D, we know from the Riemann mapping theorem that there is a conformal mapping that takes any simply connected domain into the unit disk in the complex plane. What about higher dimensions? Can we define $\boldsymbol{F}$ as a computable flow that maps a non-convex $\mathcal{D}$ into a convex $\mathcal{D’}$?

2. Speeding up the sampling (MCMC) process of the microstates according to the MaxEnt distribution \eqref{gibbs}.

3. Another computational difficulty is solving for the Lagrangean multipliers in \eqref{gibbs} (or \eqref{mib}), because in general we do not know explicitly the counting function. This is usually done approximately, with various stochastic root-finding methods. Improvements would be dearly needed here.

##### References

1. Sz. Horvát, É. Czabarka and Z. Toroczkai, Phys. Rev. Lett. 114, 158701 (2015). arXiv:1407.0991.
2. E. T. Jaynes, Phys. Rev. 106, 620 (1957).
3. E. T. Jaynes, Phys. Rev. 108, 171 (1957).
4. S. Pressé, K. Ghosh, J. Lee and K. A. Dill, Rev. Mod. Phys. 85, 1115 (2013)
5. D. Strauss, SIAM Rev. 28(4), 513 (1986)

This work was supported in part from Grant No. FA9550-12-1-0405 from the U.S. Air Force Office of Scientific Research (AFOSR) and the Defense Advanced Research Projects Agency (DARPA) and, additionally, in part by the Defense Threat Reduction Agency (DTRA) under Grant No. HDTRA
1-09-1-0039.