The Entropy Hut

Failing Expectations with the Maximum Entropy Principle

Zoltan — Thu, 06 Aug 2015 16:18:29 +0000

This post is on the heels of a paper ¹ we recently published (with Éva Czabarka) on the degeneracy problem of the Maximum Entropy Method. The post is fairly long as it also provides a tutorial on the topic.

When we are trying to understand a system or predict its behavior, we usually do that based on limited information. In everyday situations we make our decisions and choices using some sort of “inner” probability distribution shaped by past experience. However, human decision-making tends to be biased; just this wiki page enlists over 150 such biases. Naturally, this raises the question whether there is a way to make in-silico, unbiased quantitative inferences based on limited information.

Edwin T. Jaynes, 1922-98
(Washington U. Libraries)

The father of a principled and quantitative approach to unbiased inference making is physicist Edwin T. Jaynes. Starting off his career under the tutelage of none other than Eugene (Jenő) Wigner, he brought significant contributions in several areas of physics, most notably in statistical mechanics and its applications. Standing on the shoulders of Laplace, Bayes and Shannon, Jaynes started a revolution in statistical inference with his two seminal papers²^,³ from 1957. He based his inference method on the notion of statistical ensembles, much in the same way that statistical mechanics uses the ensembles of microstates to describe macroscopic properties of matter.

In his two celebrated papers Jaynes has re-derived many results from both equilibrium statistical mechanics and the time-dependent density matrix formalism of quantum physics using only the Maximum Entropy Principle. The plot below shows what it means to write papers for the ages: his two papers from 1957 have been accumulating citations exponentially, with citations doubling roughly every ten years.

Citations to Jaynes’s 1957 papers over the years.
Source: Web of Science; Thomson-Reuters.

The Maximum Entropy (MaxEnt) Principle

The MaxEnt is based on a simple premise: Given the available information about a system, make predictions using only that information and without introducing additional biases. Tautologically speaking, once the available information has been incorporated, there is no information left for us to exploit. It is as if trying to predict the throw of a fair coin, or the upside of a thrown fair dice. Clearly the uniform choice will be the best choice in this case, but to express this in a general and quantitative way, following Jaynes, one uses the notion of information theoretic entropy introduced by Claude Shannon. Let $\Omega$ denote the space of all possible states of our system of interest. A particular state $\boldsymbol{\mu} \in \Omega$ of the system (or microstate in physics parlance) specifies all the variables necessary to uniquely determine the state. For our purposes it is sufficient to think of $\Omega$ as a discrete collection of states, but in general can be a continuous space as well. Our predictions about the system will be in the form of a probability distribution $P(\boldsymbol{\mu})$ over $\Omega$. One can think of the available information as constraints on the set of microstates that the system can be in: without any information we cannot say anything about the state of the system and thus any state is fair game with the same probability, i.e., $P(\boldsymbol{\mu})=|\Omega|^{-1}$ and thus the uncertainty is maximum. As information is added, it changes the distribution to something else; if we happen to know the precise state of the system $\boldsymbol{\mu}_0$, then $P(\boldsymbol{\mu})=1$ for $\boldsymbol{\mu}=\boldsymbol{\mu}_0$ and $P(\boldsymbol{\mu})=0$, for $\boldsymbol{\mu}\neq\boldsymbol{\mu}_0$, and in this case there is no uncertainty.

Shannon defines information as the amount of surprise we get when some event $\boldsymbol{\mu}$ actually happens, whose probability of happening is otherwise $P(\boldsymbol{\mu})$. Without going into details, the information content in the event $\boldsymbol{\mu}$ happening is defined as $I(\boldsymbol{\mu})=-\log{P(\boldsymbol{\mu})}$. If $\boldsymbol{\mu}$ is the event of snow falling in San Diego then when that actually happens is very surprising (large $I$) and even CNN will talk about it, because $P(\boldsymbol{\mu})$ is very small (last happened in 1967). The same event happening at the North Pole, however, carries no surprise ($P(\boldsymbol{\mu})$ close to unity, small $I$). As a functional of the distribution $P$, Shannon’s entropy \begin{equation} S[P] = \langle I(\boldsymbol{\mu})\rangle = \; – \sum_{\boldsymbol{\mu} \in \Omega} P(\boldsymbol{\mu}) \log {P(\boldsymbol{\mu})} \label{entropy} \end{equation} expresses the average uncertainty induced by the distribution $P$. Influenced by his statistical physics background, Jaynes incorporates the available information in form of ensemble averages of observables $m_i(\boldsymbol{\mu})$, $i=1,\ldots,K$ (here $K$ is the number of constraints): \begin{equation} \overline{m}_i= \sum_{\boldsymbol{\mu} \in \Omega} m_i(\boldsymbol{\mu}) P(\boldsymbol{\mu})\,,\;\;\;i=1,\ldots,K\,. \label{mi} \end{equation} The $\overline{m}_i$ represent values that we know about the system (e.g., we measured them, someone gave them to us, or we simply assumed them). This form in expectation of the constraints is rather natural in statistical physics, where information is in the form of experimentally accessible macroscopic variables (e.g., pressure, temperature, magnetization, etc.), which are averages of microscopic ones. Thus, we are looking to find that distribution $P(\boldsymbol{\mu})$ which obeys the constraints \eqref{mi}, it is normalized $\sum_{\boldsymbol{\mu}\in \Omega} P(\boldsymbol{\mu}) = 1$ and beyond that it is the least biased, or maximally uncertain, i.e., for which entropy \eqref{entropy} is maximal. As Jaynes puts it, this is the distribution that is the least committal towards the unavailable information. Using the standard method of Lagrange multipliers we then maximize \eqref{entropy} subject to \eqref{mi} and normalization and obtain the parametric family of Gibbs distributions: \begin{equation} P(\boldsymbol{\mu};\boldsymbol{\beta}) =\frac{1}{Z(\boldsymbol{\beta})} e^{-\boldsymbol{\beta}\cdot\boldsymbol{m}(\boldsymbol{\mu})} \label{gibbs}\end{equation} where \begin{equation} Z(\boldsymbol{\beta}) = \sum_{\boldsymbol{\mu}\in \Omega} e^{-\boldsymbol{\beta}\cdot\boldsymbol{m}(\boldsymbol{\mu})} \label{pf} \end{equation} is the normalization factor, also called partition function. The $\boldsymbol{\beta} = (\beta_1,\ldots,\beta_K)$ Lagrange multipliers are then determined from solving \eqref{mi} with the form of \eqref{gibbs} replacing the $P(\boldsymbol{\mu})$ to obtain unique values $\boldsymbol{\beta} = \boldsymbol{\beta}(\overline{m}_1,\ldots,\overline{m}_K)$. This step is also called fitting because these are usually nasty equations that can only be solved numerically.

Throwing 3-sided dice

As a simple example, consider a 3-sided dice. (Take a standard dice and repeat every number 1,2,3 twice on its opposite sides, or use an artistic dice like the one by Nvenom8 shown in the figure.) Thus $\Omega = \{1,2,3\}$. Assume that after throwing the dice very many times Alice reports to Bob the average value of the upside face as $\overline{m} = 2.5$. Based on this information only, Bob wants to compute the MaxEnt probabilities $P(1)$, $P(2)$, $P(3)$ of the topside face. In this case Eq. \eqref{mi} is $P(1) + 2 P(2) + 3 P(3) = \overline{m}$ and with normalization $P(1) + P(2) + P(3) = 1$ we have one more unknowns than equations. The MaxEnt distribution will simply be $P(m; \beta) = e^{- m\beta}/Z(\beta)$, $m=1,2,3$ and $Z(\beta) = e^{-\beta} + e^{-2\beta} + e^{-3 \beta}$. Eq. \eqref{mi} becomes a quadratic equation for $e^{-\beta}$ with the only valid solution of: \begin{equation} e^{-\beta} = \frac{\overline{m}-2+\sqrt{\Delta}}{2(3-\overline{m})}\,,\;\;\Delta = -3\left(\overline{m}\right)^2+12\overline{m}-8\end{equation} leading to $P(1) = \frac{1}{6}(10-3\overline{m}-\sqrt{\Delta})$, $P(2) = \frac{1}{3}(\sqrt{\Delta}-1)$ and $P(3) = \frac{1}{6}(3\overline{m}-2-\sqrt{\Delta})$. In particular, for $\overline{m} = 2.5$ we get $P(1) \simeq 0.116$, $P(2) \simeq 0.267$ and $P(3) \simeq 0.617$, indicating the most likely biases in the dice. The plots of these three probabilities as function of $\overline{m}$ are shown below.

MaxEnt probabilities for the three-sided dice.

What if Alice reports $\overline{m} = 2$? In this case $P(1)=P(2)=P(3) = 1/3$, which are the same values as for an unbiased dice. Does this mean that the dice is fair? If the only information we have is the average topside face value, we cannot say whether the dice is biased, only that the available information does not contradict the assumption that it is a fair dice. To see this, assume that Alice reports in addition, also the average of the square of the upside face value, i.e., $\overline{m^2}$. In this case we have full information, namely, as many unknowns as independent equations and we find that $P(1) = \frac{1}{6}(6-5\overline{m}+\overline{m^2})$, $P(2) = 4 \overline{m} – \overline{m^2} – 3$ and $P(3) =\frac{1}{2}(2-3\overline{m}+\overline{m^2})$. Thus, if Alice reports that $\overline{m} = 2$ (same as it would be for a fair dice) but $\overline{m^2} = 5$ then $P(1)=P(2) = 1/2$ and $P(3) = 0$, and we actually have a heavily biased dice.

For a detailed account on the MaxEnt method, its dynamical version MaxCal and for their many applications see the nice review⁴ by Pressé et al.

The degeneracy problem

First, a disclaimer: All social networks in the following description are purely fictitious, and any resemblance to a network of specific living, dead or undead people would only be possible by a significant stretch of the imagination. Intelligence agencies have discovered that a new extremist but nefarious and intolerant ideology is secretly taking hold across the globe. From indirect evidence and intercepted, but heavily encrypted communications it was found that there are $N=9$ cells involved in nefarious activities all over the world under the aegis of this new ideology. Based on their data, the defense analysts expect that $\overline{M}=17$ pairs of cells interact regularly and that $\overline{T} = 19$ different triplets of cells have collaborated at various times to realize their attacks. However, it is not known specifically which cells are interacting with one another and which cells participated in which activity at any time. The analysts want to know the likelihood that the 9 cells form a single connected network and they also want to find the most likely network that the available information $\left(N,\overline{M},\overline{T}\right)$ is consistent with.

This problem looks ripe for the MaxEnt method. In this case the set of microstates $\Omega$ is formed by all the simple, labeled graphs $G \in \Omega$ (here $G$ replaces $\boldsymbol{\mu}$ in the above) on $N=9$ nodes. There are $|\Omega| = 2^{N(N-1)/2} = 68\,719\,476\,736$ such graphs.

The two constraints in expectation are the number of edges $\overline{M}=17$ and the number of triangles $\overline{T} = 19$. Since the number of nodes is small, in this case we can exactly enumerate (in-silico) all the simple, labeled graphs (graphs from here on) with a given number of edges and triangles and thus compute exactly all the probabilities involved in Jaynes’s method. If $\beta$ is the multiplier corresponding to the number of edges and $\gamma$ for the number of triangles, after solving \eqref{mi} for $\beta, \gamma$, one finds $\beta = 1.3017$ and $\gamma = -0.6325$ and the rather peculiar distribution shown below.

Distribution of graphs with given number of edges in the MaxEnt model for the nefarious network of cells.

It seems as if the MaxEnt method is confused and cannot make up its mind: the microstates (graphs) with the highest probabilities (two peaks in the figure) are either very sparse (few edges) or very dense (almost all connections are there). The average number of edges is certainly respected from the combination of the two types in the sum \eqref{mi}, but the individual realizations/microstates are rarely near the expected value. Moreover, in one case (sparse) the network is disconnected into pieces, whereas in the other (dense) forms a single component, and both are highly probable. But the situation is even worse: since in this case we can enumerate everything, we find that there are $64\,260$ (labeled) graphs with exactly 17 edges and 19 triangles but none of them is fully connected. Yet, the MaxEnt method tells us that there are a significant number of connected graphs! More precisely, the probability that a graph sampled by the MaxEnt method will be fully connected is 0.6. None of the connected graphs (of the 60%), however, has 17 edges and 19 triangles, so they don’t have the observed property at all. This answer is completely useless for our analysts. This effect is called degeneracy, and has been known for nearly three decades, since the works of David Strauss⁵ in social networks. Other examples of degeneracy are shown in our paper, one involving the now famous karate-club dataset by constraining the number of edges and two-stars, another involving a network of jazz musicians, where we constrain three quantities in expectation: edges, two-stars and triangles and a third example involving groups of permutations (related, e.g., to predicting outcomes of horse races).

Note that the constraining values $\overline{M},\overline{T}$ could have come from an actual realization/state of the system from $\Omega$. A degenerate distribution in this case may even predict a near zero probability for the realization from which the data was obtained!

Historically, in degenerate cases people simply have thrown away the existing data and tried choosing other types of measures and data for it. With some luck, degeneracy would be avoided. This is certainly unsatisfactory, as we might not be able to collect new data on other types of measures (as in the case of the nefarious network of cells above), or the measures used are of particular interest in the field (like triangles in sociology). We thus have to have a method that maximally exploits the data that is available to us, instead of avoiding them.

So what causes degeneracy?

To answer that, let us first make the observation from \eqref{gibbs} and \eqref{pf} that all microstates (graphs in our case) that share the same values for the measures $\boldsymbol{m}$ will all have the same MaxEnt probability. So instead of summing over the space of microstates (which is rather uncanny because it is huge and usually we know little about it), we can gather all the terms (microstates) in the sum with the same $\boldsymbol{m}$ into one group and then sum over the range of values for $\boldsymbol{m}$. In particular, the partition function \eqref{pf} becomes: \begin{equation} Z(\boldsymbol{\beta}) = \sum_{\boldsymbol{m}} {\cal N}(\boldsymbol{m})\, e^{-\boldsymbol{\beta}\cdot\boldsymbol{m}} \label{pf1} \end{equation} where ${\cal N}(\boldsymbol{m})$ is the number of microstates with given values for the measures $\boldsymbol{m}$. In physics this function is called the density of states (d.o.s) function. Thus, the averages \eqref{mi} can be expressed as \begin{equation} \overline{m}_i = \sum_{m_i = – \infty}^{\infty} m_i\,p(\boldsymbol{m};\boldsymbol{\beta})\,\;\;\;i=1,\ldots,K\,, \label{mib}\end{equation} where \begin{equation} p(\boldsymbol{m};\boldsymbol{\beta})=\frac{{\cal N}(\boldsymbol{m})}{Z(\boldsymbol{\beta})}\,e^{-\boldsymbol{\beta}\cdot\boldsymbol{m}}\,. \label{pmb}\end{equation} Note that in all the above the $\boldsymbol{m}$ appear as free variables. Eq. \eqref{pmb} expresses the probability that the MaxEnt distribution generates (or samples) a microstate with given values for measures $\boldsymbol{m}$. Thus, in degenerate cases $p(\boldsymbol{m};\boldsymbol{\beta})$ shows two or more peaks as function of $\boldsymbol{m}$ (see our paper for a more precise description). Since the exponential in \eqref{pmb} is a monotonic function of $\boldsymbol{m}$ it then follows that the culprit behind degeneracy is the d.o.s function ${\cal N}(\boldsymbol{m})$. In the paper we prove the following theorem:

Theorem: A MaxEnt model is non-degenerate if and only if ${\cal N}(\boldsymbol{m})$ is log-concave.

A function is log-concave if the log of the function, that is $\log\!\left({\cal N}(\boldsymbol{m})\right)$ is concave. The fact that we can pinpoint the cause of degeneracy is the good news. The bad news is that it is related to properties of the function ${\cal N}(\boldsymbol{m})$. It is a counting function, giving the number of objects in some space that all share the same property, for example in our case the number of simple graphs on $N$ nodes with a given number of edges and a given number of triangles (precisely that many, not on average!). This is a purely mathematical/combinatorial object and has nothing to do with MaxEnt, probabilities or inference, however, its properties determine whether the MaxEnt method will show degeneracy. The bad news is that counting problems are among the hardest problems in discrete mathematics. Even those problems whose decision version is easy (they are in P), their counting version can be very hard (from #P-complete). For example, deciding that there is perfect matching in a bipartite graph is easy (has a polynomial-time algorithm) but counting their number is very hard (no such algorithm).

The d.o.s. function on 9 nodes (color intensity) with given number of edges and triangles. The cross hair is at 17 edges and 19 triangles.

For a function ${\cal N}(\boldsymbol{m})$ to be log-concave, two conditions need to be fulfilled. On one hand the domain ${\cal D}$ of ${\cal N}(\boldsymbol{m})$ has to be geometrically convex and on the other, it has to fulfill the Prékopa-Leindler inequality (which is just the regular definition for concavity for the logarithm of a function). In many cases the first condition is broken, such as in our example of the nefarious network with ${\cal N}(\boldsymbol{m}) = {\cal N}(m_|,m_{\vartriangle})$ illustrated in the Figure to the right. Since the domain is curved, waxing-crescent shaped, the domain is non-convex (a straight segment with end-points close to the tips will have most of its points outside the domain). The next figure then shows in this 2D map the bimodal distribution of graphs $p(\boldsymbol{m};\boldsymbol{\beta})$ as sampled by the MaxEnt method.

The MaxEnt model samples the graphs according to a bimodal distribution (peaks are at darker blue regions).

The non-convexity of ${\cal D}$ comes from the fact that the variables, i.e., the components of $\boldsymbol{m}$ are not independent, more precisely, they are non-linearly dependent from one another. In classical statistical mechanics the constraining variables (in expectation) are independent quantities characterizing different types of interactions with the environment such as pressure (mechanical interactions), chemical potential (chemical or particle exchange), temperature (thermal), magnetic susceptibility (magnetic), etc. In this case the domain of the d.o.s. function is always convex (a hyperrectangle) and degeneracy does not occur. Note that if the variables are linearly dependent, the domain is still convex and degeneracy does not occur in this case either (assuming the concavity condition for the log of the d.o.s. is valid).

Can we eliminate degeneracy?

Assume that we don’t have the possibility to switch to a different set of variables, but instead we have to work with the available measures and the data for them. Still using the same data, and without losing information we would like to make good predictions based on the MaxEnt approach. Let us call our original (degenerate) model (or ensemble) the MaxEnt($\boldsymbol{m}$) model (ensemble). The idea is to introduce a one-to-one mapping (so no information is lost) from the original variables (and the data for them) to new ones $\boldsymbol{m} \leftrightarrow \boldsymbol{\xi} = \boldsymbol{F}(\boldsymbol{m})$ such that the corresponding d.o.s. function (number of microstates with given $\boldsymbol{\xi}$), ${\cal N}’\left(\boldsymbol{\xi}\right)$ is now log-concave and thus the new model (ensemble) MaxEnt($\boldsymbol{\xi}$) is non-degenerate. Since $\boldsymbol{F}$ is bijective, the number of states with $\boldsymbol{\xi}$ is the same as the number of states with $\boldsymbol{m} = \boldsymbol{F}^{-1}\left( \boldsymbol{\xi}\right)$, i.e., ${\cal N}’\left(\boldsymbol{\xi}\right) = {\cal N}\left(\boldsymbol{F}^{-1}\left( \boldsymbol{\xi}\right)\right)$. The observation behind our idea is that while a function could be concave (or convex), its composition with another function, in the new variables can have altered convexity properties. Note that such mappings do not require the collection of new data! One simply applies the transformations on the data we have.

While we do not have a general method for finding such transformations, in concrete situations they can be done relatively easily. A simple example of degeneracy in our paper is the case when the constraining variables are the number of edges $m_{|}$, and the number of wedges (2-stars) $m_{\vee}$. Since on average the latter is proportional to the square of the former, we may choose $\xi_{|} = {m_{|}}^{2}$ and $\xi_{\vee} = m_{\vee}\ $(the choice is not unique, see our paper). In our nefarious network example, we may choose $\xi_{|} = {m_{|}}^{3}$ and $\xi_{\vartriangle} = m_{\vartriangle}$. In all these cases the new model $\text{MaxEnt}(\boldsymbol{\xi})$ will be non-degenerate and based on the same input information with ${\overline{\xi}}_{i} = F_{i}\left( {\overline{m}}_{1},\dots,{\overline{m}}_{K} \right)$, $i = 1,\dots,K$ (chosen that way). Since $\text{MaxEnt}(\boldsymbol{\xi})$ is a new model, we may ask how far are the averages of $m_{i}$ within the new $\text{MaxEnt}(\boldsymbol{\xi})$ ensemble from the original data values ${\overline{m}}_{i}$. Since, for example, instead of requiring that the number of edges on average is 10, we now require that the square of the number of edges is 100 on average—we do not expect this to cause differences that are too big. Indeed, we can show that due to the fact that the new ensemble $\text{MaxEnt}(\boldsymbol{\xi})\ $ is concentrated around $\overline{\boldsymbol{\xi}}$ , the differences are actually rather small (see the last paragraph on pg 3 of our published paper¹), and the new model still respects \eqref{mi} with good approximation. The figure below shows the corrected MaxEnt distribution by edge-number (black) for the case of the nefarious network.

In the corrected model, $\text{MaxEnt}(\boldsymbol{\xi})$, typical samples are drawn from near the original data values.

Indeed, the samples are now coming with high probability from around the observation value of 17. The red shows the probability that a MaxEnt sampled graph with a given edge number will be disconnected and the green that it will be connected; now the MaxEnt predicts that 94% of sampled graphs will be disconnected, much better than previously. The next figure shows the MaxEnt distribution in 2D, in the corrected $\text{MaxEnt}(\boldsymbol{\xi})$ model; it is concentrated around the observation values. The plot below is in terms of the original variables; in terms of the new ones it would have a convex shape, see our paper.

The corrected $\text{MaxEnt}(\boldsymbol{\xi})$ distribution as function of the complete set of variables.

Discussion

We have shown that the root cause of degeneracy in MaxEnt models is the existence of non-linear constraining relationships between the variables. One can eliminate degeneracy by using a one-to-one nonlinear mapping to a set of new variables (using the same data), and defining a MaxEnt model based on these new variables. The new MaxEnt model will be generating samples with high probability from the neighborhood of microstates for which the given constraints are typical, and the mean values of the original variables in this new model will be close to the original input data.

A few comments on MaxEnt models are in order, notwithstanding degeneracy issues.

A. The input data may be applied within a frequentist or a Bayesian framework. The data may come from several repeat observations and the averages represent actual averages of those observations (frequentist). Or, based on our best knowledge we may believe that we can interpret the single dataset available to us as representing typical values and thus we choose to treat them as expected values (Bayesian). In the latter case we will actually generate a MaxEnt ensemble within which the input values are indeed typical. However, if the data came from an atypical or special state of the system, then our predictions will be off when compared against new empirical data for those predictions. When that happens we know that our view of the system is off and thus we can use the MaxEnt as a tool to help update our understanding of a complex system — more on this in the next point.

B. Given a set of constraining measures $\boldsymbol{m}$ we sometimes want to predict the value (in expectation) of another measure (such as the number of triangles knowing the number of edges and 2-stars), or even its distribution. This can only be done to the extent that the constraining measures determine the variable of interest. However, if the variable of interest is independent from the constraining ones, the MaxEnt model will be useless for predicting that variable, because the information we have says nothing about what we are interested in. Jaynes makes the observation that this aspect is actually a strength of the MaxEnt method: if our predictions are way off for a quantity of interest when compared against empirical data, it means that we have not included the relevant variables into our model for that quantity. Unfortunately, this does not say what are the relevant variables, just that the ones used are irrelevant.

C. The output of the MaxEnt model is the distribution \eqref{gibbs} and we have to sample microstates from $\Omega$ according to this distribution. This, in itself, is a thorny issue and one typically does it using a Markov Chain Monte Carlo algorithm. As a matter of fact this step is a computational bottleneck that limits MaxEnt applications to relatively small sets of data.

D. One of the original criticisms of Jaynes’s method was its subjective (Bayesian) nature. Namely, physical reality does not depend on our state of knowledge and thus it brought into question the method that maximizing the information entropy (which expresses the uncertainty in *our* knowledge about a system) can be used to generate predictions about physical reality. However, since then several authors, starting with Shore and Johnson have demonstrated that any algorithm that draws inferences about probability distributions in a self-consistent and unbiased way will be maximizing the entropy function \eqref{entropy}—see the review⁴ by Pressé et al.

Open problems

1. Is there a general method/algorithm that computes/generates the mapping $\boldsymbol{F}$ when $\mathcal{D}$ is simply connected such that in the new variables the counting function is log-concave? In 2D, we know from the Riemann mapping theorem that there is a conformal mapping that takes any simply connected domain into the unit disk in the complex plane. What about higher dimensions? Can we define $\boldsymbol{F}$ as a computable flow that maps a non-convex $\mathcal{D}$ into a convex $\mathcal{D’}$?

2. Speeding up the sampling (MCMC) process of the microstates according to the MaxEnt distribution \eqref{gibbs}.

3. Another computational difficulty is solving for the Lagrangean multipliers in \eqref{gibbs} (or \eqref{mib}), because in general we do not know explicitly the counting function. This is usually done approximately, with various stochastic root-finding methods. Improvements would be dearly needed here.

References

1. Sz. Horvát, É. Czabarka and Z. Toroczkai, Phys. Rev. Lett. 114, 158701 (2015). arXiv:1407.0991.
2. E. T. Jaynes, Phys. Rev. 106, 620 (1957).
3. E. T. Jaynes, Phys. Rev. 108, 171 (1957).
4. S. Pressé, K. Ghosh, J. Lee and K. A. Dill, Rev. Mod. Phys. 85, 1115 (2013)
5. D. Strauss, SIAM Rev. 28(4), 513 (1986)

This work was supported in part from Grant No. FA9550-12-1-0405 from the U.S. Air Force Office of Scientific Research (AFOSR) and the Defense Advanced Research Projects Agency (DARPA) and, additionally, in part by the Defense Threat Reduction Agency (DTRA) under Grant No. HDTRA
1-09-1-0039.

The Cost of Peer Review

Zoltan — Wed, 15 Apr 2015 14:02:40 +0000

(Update added April 20, 2015, see the end of this post.)

I have been procrastinating for while about setting up this blog, but recent events have pushed me over the bump, so here it is. In the past week about 145 editors have resigned (including yours truly) from the editorial membership board of the journal Scientific Reports, for a cause supported now by over 550 signed supporters and counting. The editorial membership board of SciRep has thousands of members. Sorry for the long post, but the issues are serious.

On March 24, 2015, Scientific Reports (SciRep from here on), an open access journal owned by NPG/Palgrave Macmillan has sent out a letter to its editorial board members, alerting us about an experiment in peer review they were conducting. See here for additional info. In a nutshell, SciRep has launched a small-scale experiment, only in biology, which would secure fast peer review for those authors who are willing to pay for it. It guarantees to return the reviews to the authors within three weeks, or their money back. To accomplish this, SciRep employs the services of a 3rd party, Research Square, using their peer review system Rubriq. More details are revealed here. Currently, fast-track service seems to be set at \$750 a pop which would be in addition to the \$1,495 fee for publishing the paper. It is not clear how much of that goes to the reviewers, I suspect a small fraction (\$100 ?) as the rest goes to the for-profit parts of the setup.

This has led to an outcry from the part of the editors, for reasons outlined in an open letter that can be found here: http://www.peerreviewneutrality.org/ . This letter is also a letter of resignation by those editors, effective immediately. If you are an editor for this journal, please read this letter and consider the reasons for it and the arguments below; hopefully you’ll sign it too. Even if you are not an editor, as a scientist, please consider supporting the cause. As I will try to outline below, the issues are larger than this particular problem, and we all need to be aware of them.

At first, this does not seem to be a bad idea: paying a reviewer for her/his work. Refereeing is hard and time-consuming and we do it for free because we feel it is our responsibility to keep up the quality of science. However, the SciRep trial is about something else: a referee gets paid if she/he returns the report within a short time frame (3 weeks) and only for those authors who opt to pay for it! And, for now, only within biology. Here are only two of the issues that this pay-for fast-track review introduces, others are raised in the open letter: 1) Since the monetary incentive is visibly only tied to the return time, this will put a pressure on referees that may lead to erroneous or superficial reviews. I frankly would want to have a better quality review than a non-informative fast one. Good quality review takes time because research is hard and good research is even harder, and thus judging difficult things will be highly non-trivial. It often takes years before someone is able to make progress on a difficult problem. Are we saying that it can be reviewed for correctness within 3 weeks by someone who has several other papers to write, referee, handle editorially, teach classes, mentor students, write grants and write reports? So maybe the goal is not to make sure that reviews are good, only to be fast; you might wonder why. 2) This sets up a two-tier system. The Haves will get quick and superficial reviews and publish before the Havenots. Just to be the Devil’s advocate, consider the extreme case of a conference where someone talks about their latest results that have just been submitted and are under review. Hypothetically, someone else at the conference calls up their students to use their own data and simulations and publish the same idea before the original author, using pay-for fast-track. The key word here is publish (because people can always post to archives). This incentive system, in general, will create advantages and advances for some people based not on scientific excellence, but on other criteria (on funding availability). It will definitively disadvantage the younger generation that normally has less funding; or folks from poorer countries.

The journal has responded to the questions raised in the (now) open letter, the response can be read here. To questions about the qualifications of the peer review coordinators at Rubriq the reply was a brief “minimum a PhD degree in biomedical sciences”. The answer was clearly unsatisfactory and the letter went open. There is wide range of variation in expertise within scientists with a PhD. The scientific hierarchy is very large, ranging from the freshly graduated, still green-eared PhDs (no offense, we all were there at some point) to members of the National Academy of Sciences, Nobel Prize winners, etc. Why not try to enlist the biggest and most respected names in a research area who genuinely care about the dissemination of good science? Why eliminate expert based peer-review? A fresh PhD does not make someone an expert, believe it or not; it is only the beginning of expertise. Why not pay referees from the overall profits of the journal, uniformly or by some commonly agreed (with expert-editors) scheme that does not discriminate based on criterias other than excellence? The answer lies within the bigger picture.

The bigger picture

Publishing is a big business and it operates as big businesses do. Big mergers generate bigger sticks for their owners to help increase their control of the market, and of course, reduce competition. See this report on a recent “consolidation” step between Macmillan Science and Education and Springer Science+Business Media. The article also reports on the five largest publishers, the number of journals they own and some estimates of their revenues: “Elsevier reported £2.1 billion revenues in 2013 (€2.7 billion), while its parent company, Reed Elsevier, made £6.6 billion, but only £1.1 billion of that is from science journals, Aspesi estimates.” Once there is little or weak competition, the prices will go up, simply because they can. In old times the costs of publishing were real, as someone had to actually typeset the article. New technologies, including the Internet have made the real costs of publishing drop drastically and thus one would have expected a drop in publishing costs as well. Instead, as Donald Knuth nicely shows it in an open letter written 12 years ago, the per-page costs of published material has increased (doubled by 2003)!

Publishing is also based on “sound” business fundamentals: someone does the hard part of the work for free while others reap its benefits. In particular, as a scientist you invest serious amounts of time and effort to generate an intellectual property. To disseminate your results, you typeset your own paper using typesetting software following the guidelines imposed by the journal; you also generate figures and art-work to make your paper visually more appealing, again following journal guidelines; you are also often asked to write pilot texts and additional material to better advertise your own work through the journal and following journal guidelines; then you spend significant effort on reviewing similar type of work by other scientists, following journal guidelines; you also act as an editor having the responsibility to choose the referees for the paper that you are handling for the journal, again following the guidelines. And you do all of this for free! Yep. Pro bono, Gratis, Free. If this does not sound ridiculous to you I invite you to read this writing by the one and only Scott Aaronson. Yeah, folks, as Scott puts it: it is time to realize that you’d been had.

To increase their profits, big publishers have simply increased subscription prices to the point that university libraries could no longer fork them out. They have also resorted to various sneaky tricks from bundling to legal bullying as described in a post by John Baez here. The community of scientists then formed several organizations to combat this, such as the Budapest Open Access Initiative. This is how open access journals have come into existence, such as PLoS journals and SciRep to name a few. You’d think that this resolves the problem… Nah.

So let’s see. Imagine you are an ambitious, energetic fellow with a strong desire for eternal riches, at the helm of an open access journal. An open access journal is nothing but a queuing system: a manuscript pops in, it is processed for a time $T$, then it pops out as a publication; you collect a fee, with a net profit of $X$ per paper. Great, after $N$ papers in a given period of time, let’s say a year, you collect a $P=NX$ profit. Now what could you do to increase $P$? You can raise the fee per paper but that is already high enough so that a further increase would limit the number of submissions (would also be more expensive than the competitors). You need to grow $N$. Assuming a plentiful demand/submission, your only bottleneck is $T$. What goes into the time lag $T$? At least two things: a) initial format check and b) peer review time. In the beginning, your staff could handle a) efficiently, the turn-around time was a few days. But since the submissions have sky-rocketed, it went up to few weeks for every submission and re-submission. You’d need to hire more staff, but that would reduce your profits! A fellow editor Gyorgy Korniss and others have already noticed this increase in quality check times. Clearly, b) is the option to change. But how can you reduce the refereeing time? For one, make it easier for a paper to go through: reduce the criteria for rejections. Among these, the most important one is technical correctness, so keep that one, remove all others. It does not matter, if it is ultimately an irrelevant and technically insignificant result, as long as it is correct. I got very surprised the first time when the journal pushed back on me for rejecting papers written at a graduate student homework level. Other colleagues had the same experience. Next, in your standard letter to reviewers you could request that the referee returns his/her reports within one week. Of course, that is unreasonable and will not work. What else? Light-bulb moment: queues are serial, we need to parallelize! Take the topic in which you received the most submissions, and outsource it to some 3rd party. But if you were to pay for that from the existing income you might be lowering your profits. Another light-bulb: outsource only those papers that the authors are willing to pay for an extra fee, yet still collect the regular $X$ (at least, it depends on the agreement with the 3rd party). Isn’t this a great scheme ?

If you look back at history, in the long term, nothing has generated more value than scientific research, and technology. No-one can pay for it, really. It saves lives, increases life expectancy, and one day it might even take us to other planets. A little respect, please.

Update, April 20, 2015

SciRep responded today in an email to the editors who have signed the open letter. In a nutshell, the letter acknowledges the issues as “philosophical issues” potentially brought on by the fast-track peer-review service. It expresses their regret that they have not discussed their plans with us in advance. It also gives a summary of the outcome of the trial, acknowledging that “for some communities speed matters and that there is a demand for a service of this type”. The letter then concludes with a paragraph filled with reiterations of commitment towards scientists and science advancement.

There is no statement in the response letter on whether the journal will proceed with implementing the fast-track peer-review service; only that this experiment was not intended to be scalable. However, if they are focused on increasing their profits then I think it is likely that this service will be implemented in some shape or form. Both the original announcement of the trial and the most recent response letter by the journal builds around the argument that there might be a demand for such a service. Of course there would be a demand for such service! Just as there would be a demand for publishing without a review, or at most after a quality/language check! And there is no need to do that experiment. It has been done and it is called the arXiv. With more than a million submissions by the end of 2014 and over 8000 monthly submissions it demonstrates the baseline for need of scientific publishing (papers submitted to the arxiv are not peer reviewed, only moderated).

The response letter does not address any of the potential issues raised by a fast-track service if it were to be implemented (see here as well). One can add yet another problem: fast-track service may dis-incentivize some reviewers who work on a non fast-track paper: why would someone generate a review within 3 weeks, whereas those who work on a fast-track paper will be paid for the same work, and this reviewer will not? This might actually slow down the journal’s throughput on the regular articles, lowering their profits. While some of these problems could be addressed, the biggest one to me is that the payments would be tied to the speediness of the service, not to the quality of the work by the referees or by the authors. Such services cannot improve on the quality of published science, only dilute it.

For the incident described in this blog and also based on my recent and past experiences with this journal I decided to withdraw my support from it. I will not serve nor as an editor nor as a reviewer for SciRep, at least not until it improves its practices to better serve the interests of the scientific community.