It has generally been noted that the multi-model mean in AR4 outperforms the individual members of the ensemble. In this post, I’ll specifically look at this within the context of the 20th century hindcast of surface temperatures, to see to what degree this is the case, as well as reasons why this might be the case. After all, the following figure appears twice in the IPCC AR4 report, in chapter 8 and 9:
Like my last post, this one gives me a chance to dive more into the CMIP3 models and work on my time series programming in R. Once again, I’ll be looking at the 54 ensemble members for SRES A1B that I got from Climate Explorer. The code and data can all be downloaded here. The R file itself will read the HadCRUT data over the internet. I was planning on including the ability to use additional surface temperature datasets for the 20th century, at least GISS, but the format is a bit more difficult to parse. I may need to borrow one of Kelly O’Day’s scripts to help with that processing in the future.
What is “better”?
Not being a stats guy (and I could be way off here), I wanted to use a couple of simple methods to determine which hindcasts might be considered the “best” or “closest” to actual observations. I chose two metrics: 1) the RMSE after baselining on the 1900-1950 average, and 2) the correlation between the diff(obs) and diff(model_run). While 1 and 2 might not be totally independent, getting a high score in #1 does not guarantee a high score in #2, and visa versa. There’s also a question of what length of time we should be looking at…obviously, anything less than 1 year averages is suspect, but using a 30 year average only leaves us with 3 (independent) observations.
Root Mean Square Error
The following histogram shows the RMSE for 1 year average temperatures (basically what we see in the IPPC figure above):
By this metric, the multi-model mean does perform the best. The “linear trend” value refers to the RMSE of a linear trend line fit to the 20th century temperature, and the “flat line” value refers to the RMSE of a flat line at the 1900-1950 average value. That the multi-model mean does better than the simple linear trend seems to suggest that it gets some of the up and down fluctuations correct (mostly around volcanic eruptions). The fact that a simple flat line does not look worse than some of the runs is somewhat surprising, and does not speak too highly of some of those runs.
What happens if we look at 10 year averages (limiting us to only 10 points per run)?
Once again, the multi-model mean performs pretty well compared to the total set of runs, and better than the linear trend, but technically not the “best” (although it is possibly within the errors bars based on uncertainty in the HadCRUT dataset).
Correlation of Diffs
What about if we try to capture the “forced” fluctuations, minus the trend? For 1 year, the correlations look something this:
The multi-model mean performs better than most of the individual runs, although this may be purely by luck, as none of the runs do a particularly good job of guessing year-to-year fluctuations apart from the trend. Of course, they are not meant to – weather noise at these small intervals are supposed to swamp any forced changes.
Instead, a better metric for the intended use might be the 10-year fluctuations:
In this case, the multi-model mean appears to be in the center of the pack. What’s more, the R values are not particularly impressive, given the low DF, even for determining the difference between 10 years (minus the overall trend). More on this later.
Why does the multi-model mean perform better (at least in terms of RMSE) than individual models?
As I mentioned, I’d often heard it was the case that the MMM outperformed invidual models, but I wasn’t sure why this should necessarily be the case. Along the way I stumbled across this post from Science of Doom, where he digs up the following quote from IPCC chapter 8:
Why the multi-model mean field turns out to be closer to the observed than the fields in any of the individual models is the subject of ongoing research; a superficial explanation is that at each location and for each month, the model estimates tend to scatter around the correct value (more or less symmetrically), with no single model consistently closest to the observations. This, however, does not explain why the results should scatter in this way.
In other words, the explanation is not clear. One thing we might consider is that there is a forced temperature response in the models, but that each of the runs has different “weather noise” getting in the way, and that the multi-model mean – with the averaging cancelling out most of the noise – settles pretty well along around the forced response. The observations themselves, being simply another combination of this forced response + “weather noise”, will not have their noise match up well the noise from another run, and so the RMSE will be lower when compared to this multi-model mean than to any individual run.
I simulated this scenario by generating various time series consisting of red noise + sin wave + linear trend. The first 54 runs I consider the “model”, the 55th I consider the “observations”, and then the average of the first 54 is the “multi-model mean”. One example of the result is shown below (using the same coloring as the IPCC run):
It looks somewhat similar to the actual hindcast, with the multi-model mean showing considerably less variation than either the observations or the individual model runs. In almost all my runs, the fake MMM had a lower RMSE than any of the individual fake models compared against the fake observations. But the R value of the DIFF’d series was scattered randomly around the list, not generally performing better or worse than any of the individual runs. Both of these properties seem comparable to the actual comparisons of models vs observations vs. the multi-model mean.
So, IF the residuals of the individual models minus the mutli-model mean exhibited properties of an AR(1) process, theoretically this would explain (at least for surface temperatures) why the MMM outperforms the individual runs with respect to RMSE in the hindcast. Of course, this still wouldn’t be a physical explanation of why the deviation from the forced response would manifest itself in the form of red noise. More on that in my next post.
Just for fun, based on the RMSE for 1-year averages and the correlation of DIFF’d values for 10-year averages, here are the “best” two ensemble members and the multi-model-mean compared against the observed values: