Troy's Scratchpad

April 29, 2011

CMIP3 model hindcasts and AR(1) weather noise

Filed under: Uncategorized — troyca @ 6:14 pm

Data and code for this post here.

After looking at the hindcasts of the models in my previous post, I wondered in that post if perhaps the differences between the MMM and each of the different model runs of the ensemble might be considered “weather noise”, able to be captured as an AR(1) process.  Assuming the MMM represents the actual forced component, this would explain why it generally seems to perform better relative to actual observations (which could be considered simply another instance of the forced component (MMM) + AR(1)) than any other individual runs. 

Well, after examining the different model runs relative to the MMM, and using the auto.arima (hat tip) and/or ar functions on them in R, it certainly does not seem like these difference are merely the result of AR(1) noise.  Some models generally seem to trend higher while others trend lower than the MMM, which is not suprising given their underlying differences. But it does raise the question again of why the MMM seems to perform better. 

Nonetheless, the errors between the MMM and the HadCRUT annual anomalies DO seem to act like an AR(1) process.  Using the approximate auto-correlation of 0.5 and rnorm standard deviation of 0.1 that were gleaned from these errors, the following image compares what it would look like with MMM + weather noise of AR(1) vs. just MMM and all the model runs.

In both graphs, there were 54 different yellow “runs”.  The bottom image should basically be a re-creation of the graph in the IPCC (seen here in my last post:),  although there are some differences.  The IPCC one says it was created from 58 simulations and 14 different models, whereas what I had access to were 54 different ensemble members from 22 different models. 

As you can see, the top image shows tighter bounds, but the observations still seem to stay within it for the 20th century.  To me, going forward, this would seem to be a better way to estimate noise in future predictions than simply showing all models, as it allows us treat the MMM as the “forced” component without rogue terrible models leading to ridiculously large error bounds.  It also allows the chance to generate as many model “runs” as we want without needing a model simulation for each.  Of course, the negative aspect of assuming a noise model based on the errors between observations and the MMM is that those parameters for the AR1 process can be uncertain…simulating a series and then trying to extract the parameters again can show some wildly different results.

Lucia has relevant discussion on this here, which got me thinking originally along this path.


April 22, 2011

CMIP3 Multi-Model Mean vs. Individual Runs in 20th Century Hindcast

Filed under: Uncategorized — troyca @ 5:55 pm

It has generally been noted that the multi-model mean in AR4 outperforms the individual members of the ensemble.  In this post, I’ll specifically look at this within the context of the 20th century hindcast of surface temperatures, to see to what degree this is the case, as well as reasons why this might be the case.   After all, the following figure appears twice in the IPCC AR4 report, in chapter 8 and 9:

Like my last post, this one gives me a chance to dive more into the CMIP3 models and work on my time series programming in R.  Once again, I’ll be looking at the 54 ensemble members for SRES A1B that I got from Climate Explorer.  The code and data can all be downloaded here.  The R file itself will read the HadCRUT data over the internet.  I was planning on including the ability to use additional surface temperature datasets for the 20th century, at least GISS, but the format is a bit more difficult to parse.  I may need to borrow one of Kelly O’Day’s scripts to help with that processing in the future.

What is “better”?

Not being a stats guy (and I could be way off here), I wanted to use a couple of simple methods to determine which hindcasts might be considered the “best” or “closest” to actual observations.  I chose two metrics: 1) the RMSE after baselining on the 1900-1950 average, and 2) the correlation between the diff(obs) and diff(model_run).  While 1 and 2 might not be totally independent, getting a high score in #1 does not guarantee a high score in #2, and visa versa.  There’s also a question of what length of time we should be looking at…obviously, anything less than 1 year averages is suspect, but using a 30 year average only leaves us with 3 (independent) observations.

Root Mean Square Error

The following histogram shows the RMSE for 1 year average temperatures (basically what we see in the IPPC figure above):

By this metric, the multi-model mean does perform the best.  The “linear trend” value refers to the RMSE of a linear trend line fit to the 20th century temperature, and the “flat line” value refers to the RMSE of a flat line at the 1900-1950 average value.  That the multi-model mean does better than the simple linear trend seems to suggest that it gets some of the up and down fluctuations correct (mostly around volcanic eruptions).  The fact that a simple flat line does not look worse than some of the runs is somewhat surprising, and does not speak too highly of some of those runs.

What happens if we look at 10 year averages (limiting us to only 10 points per run)?

Once again, the multi-model mean performs pretty well compared to the total set of runs, and better than the linear trend, but technically not the “best” (although it is possibly within the errors bars based on uncertainty in the HadCRUT dataset).

Correlation of Diffs

What about if we try to capture the “forced” fluctuations, minus the trend?  For 1 year, the correlations look something this:

The multi-model mean performs better than most of the individual runs, although this may be purely by luck, as none of the runs do a particularly good job of guessing year-to-year fluctuations apart from the trend.  Of course, they are not meant to — weather noise at these small intervals are supposed to swamp any forced changes.

Instead, a better metric for the intended use might be the 10-year fluctuations:

In this case, the multi-model mean appears to be in the center of the pack.  What’s more, the R values are not particularly impressive, given the low DF, even for determining the difference between 10 years (minus the overall trend).  More on this later.

Why does the multi-model mean perform better (at least in terms of RMSE) than individual models?

As I mentioned, I’d often heard it was the case that the MMM outperformed invidual models, but I wasn’t sure why this should necessarily be the case.  Along the way I stumbled across this post from Science of Doom, where he digs up the following quote from IPCC chapter 8:

Why the multi-model mean field turns out to be closer to the observed than the fields in any of the individual models is the subject of ongoing research; a superficial explanation is that at each location and for each month, the model estimates tend to scatter around the correct value (more or less symmetrically), with no single model consistently closest to the observations. This, however, does not explain why the results should scatter in this way.

In other words, the explanation is not clear.  One thing we might consider is that there is a forced temperature response in the models, but that each of the runs has different “weather noise” getting in the way, and that the multi-model mean — with the averaging cancelling out most of the noise — settles pretty well along around the forced response.  The observations themselves, being simply another combination of this forced response + “weather noise”, will not have their noise match up well the noise from another run, and so the RMSE will be lower when compared to this multi-model mean than to any individual run.

I simulated this scenario by generating various time series consisting of red noise + sin wave + linear trend.  The first 54 runs I consider the “model”, the 55th I consider the “observations”, and then the average of the first 54 is the “multi-model mean”.  One example of the result is shown below (using the same coloring as the IPCC run):

It looks somewhat similar to the actual hindcast, with the multi-model mean showing considerably less variation than either the observations or the individual model runs.  In almost all my runs, the fake MMM had a lower RMSE than any of the individual fake models compared against the fake observations.  But the R value of the DIFF’d series was scattered randomly around the list, not generally performing better or worse than any of the individual runs.  Both of these properties seem comparable to the actual comparisons of models vs observations vs. the multi-model mean.

So, IF the residuals of the individual models minus the mutli-model mean exhibited properties of an AR(1) process, theoretically this would explain (at least for surface temperatures) why the MMM outperforms the individual runs with respect to RMSE in the hindcast.  Of course, this still wouldn’t be a physical explanation of why the deviation from the forced response would manifest itself in the form of red noise.  More on that in my next post.

Just for fun, based on the RMSE for 1-year averages and the correlation of DIFF’d values for 10-year averages, here are the “best” two ensemble members and the multi-model-mean compared against the observed values:

Blog at