When we want to combine several estimates of a physical quantity, to obtain a more precise one that accounts for all inputs in the correct way, we routinely rely to the weighted average: we take each of the N independent, Gaussian-distributed measurements x_i +- σ_i (i=1,...,N) and compute with them the quantity
to which we associate the error shown on the right above. The recipe above is simply the result of applying the method of least squares, or -equivalently- of a likelihood maximization.
It turns out that in many cases we are confronted with estimates of the same physical quantity that appear to be incompatible with each other: say 10+-4 and 30+-5. This arises more often than it should (by random fluctuations of the measurements toward the tails of their probability density function, that is) if the measurements are subjected to unknown biases of systematic nature. A simple way to state this is that the uncertainties are underestimated. (The discussion on the distribution of systematics, which can be strongly non-Gaussian, is better left for another time!).
If we take the weighted average we do get a result in-between, and closer to the determination with the smallest uncertainty as our intuition would want it: x_wa = (10/16+30/25)/(1/16+1/25)=17.80. However, the error on the average is σ_wa=sqrt[1/(1/16+1/25)]=3.12 and so we are in the embarassing situation that our result, 17.8+-3.1, is very far from either of the two inputs.
I have described elsewhere of the way the Particle Data Group computes the error on a weighted average of N independent, Gaussian-distributed estimates of the same physical quantity. In short, what the PDG does when confronted with the problem of obtaining a meaningful error for the weighted average of determinations that are incompatible with each other, is to "scale up" the variance of each independent estimate by the same factor S, which is computed as the reduced chisquared of the N determinations: S^2 = χ^2/(N-1).
The symbol χ^2, for those of you who are not too familiar with basic statistics, is just a sum of the squares of the differences between the average and each input, divided by the corresponding variance (see right). When divided by N-1, the χ^2 (then called "reduced chisquared") should be close to unity. A value much larger than unity indicates incompatibility of the inputs.
So the PDG recipe, in a nutshell, is to replace the weighted average error σ_wa with a scaled version, σ'=σ_wa*S, by replacing each variance σ_i with their scaled version S*σ_i. The introduction of the "Review of Particle Properties" (see page 14 of the 1500-page volume) explains how this is a democratic way of handling the situation: being unable to decide which, among a set of N independent determinations, is the likely cause of the large χ^2 (read: the incompatibility of the input measurements), your most sensible choice is to scale all of them by the same amount.
Note that by applying the PDG rescaling the weighted average remains unchanged (since all weights, which are inverse variances, get scaled up by the same factor 1/S^2). This is a desirable feature of the method, of course. The method only purports to address the non-conservative nature of the default weighted average error, without making any assumptions.
But is it really the case that the PDG recipe is a "know nothing" approach to the problem ? I argue otherwise. To see why, let us consider the likelihood function of the N determinations: before the application of a scale factor, we would write this as
Now, if we replace all variances σ_i^2 with their scaled versions S^2*σ_i^2, we get the modified likelihood
Note that in this second expression I have inserted a factor k(S) which, for reasons not yet to be disclosed, I will refer to as the "prior for S". If we now take the logarithm of the likelihood, we obtain the following expression:
[ We are always allowed to exchange the likelihood with its logarithm, since the logarithm is a monotonic function, and therefore whatever value of the likelihood parameters maximize log(L), they also maximize L ! ]
Now, in order to find the maximum likelihood estimate of the parameter of interest you need to set to zero the derivative of the likelihood with respect to it. Indeed, taking the derivative with respect to S of the above quantity, and equating the result to zero, allows us to determine the value of S which maximizes the likelihood. In so doing we obtain:
Aha. Observe now that the equation above, thought as one for the unknown S, depend only on the value of the sum of the squared deviations (in variance units), which we have replaced with the χ^2 of the N determinations; and on the mysterious function k(S) and its derivative k'(S). So now, rather than solving for S, we can try and consider this an equation where k(S) is the unknown: it is a differential equation, which we may write as
It turns out that this equation is quite easy to solve (unlike differential equations, in general!). Indeed, we only need to remember the rule of derivation of the power function, d(x^α)=α*x^(α-1). The solution of the equation is just k(S) = 1/S !
Now let us stop and think at what we have found. We have found that S=[χ^2/(N-1)]^1/2 is the maximum likelihood estimate (MLE) of a common scale factor for the N inputs of the likelihood distribution of our Gaussian estimates, with blown-up variances, only if the likelihood includes a prior distribution k(S)=1/S for the scale factor itself. Or, if you do not want to speak about priors: we need to insert in the problem the notion that S is distributed with a k(S)=1/S PDF. Had we not included k(S) in the original likelihood expression, we would have found that the PDG choice of S is not the MLE !
The above may not look like much, but it does mean that the PDG prescription is justified, as a fix to the problem of inconsistent estimates, only if one believes that S is distributed as quoted; that is, a priori, before any data is there to be averaged, one believes that errors are more likely going to be scaled by a smaller, rather than a larger, common factor; and the belief is expressed in a quantitative form (e.g., it is exactly as likely that S will be found between 1 and 2 as it is that it will be found between 3 and xxx, for instance).
I think this is a remarkable observation, which clarifies that there are alternative ways of handling the problem to the one in the PDG, which may be based on alternative (but equally reasonable) prior beliefs on the distribution of S !