For my recent pairwise station tests, I’ve been comparing nearby stations with different “distance” thresholds. I’ve thus been operating under the assumption that nearby stations will record temperatures that match better with each other than to far off stations. I’m sure there are probably other studies out there that have done this, but I was curious to see what the relationship was between the correlation between stations and their distance from each other.

My first step was simply to perform an OLS regression between monthly temperatures at all station pairs within 1000 km to get an R^2 value, and then plot those R^2 values against the distance between these stations. The chart below shows the results (1):

One thing that should stand out is the very high R^2 values, even at 1000 km distances. If this is the entire story, it would seem that there would not be much dropoff at even greater distances.

Obviously, this is not the whole story. We need only look at a chart of average temperatures by month in the US:

Clearly, the month to month variance between temperatures is going to dwarf the difference between station temperatures (especially given our offset/intercept), thus giving us an artificially high correlation. This would not be a problem if I merely ran the OLS regression against the annual average temperature. Furthermore, the order of ranking stations pairs according to R^2 (I think) should remain the same regardless of the correlation inflation.

However, to get a better sense of the “true” R^2 value, we should calculate the anomaly for each month relative to that specific month in other years, rather than to some other baseline. So we adjust Jan at a station relative to all other Jan’s, Feb to all other Febs, etc. This leads to the next graph, which looks more similar to expectations(2):

It is worth noting that since the OLS regression determines the best fit slope between two station temperature series, the graphs don’t necessarily require the trends between two stations to be similar in order to achieve a high R^2 value. The graph below attempts to account for this by showing what R^2 would be if we required a coefficient/slope of 1.0 between the pairs of temperature series. (In other words, we calculate an offset and then take the sum of squares of the difference between each pair of points and divide it by the variance in Y, then subtract it from 1. It is therefore possible to have a

negative value, and the “R^2” value is no longer independent of which is the X vs Y series.) (3):

One of the reasons I attempted this experiment was to see if there was a set “cut-off” point to use a distance threshold. Unfortunately, after looking at the charts above it is hard to say for certain where this cut-off point should be.

Note: The above charts were created using the USHCNv2 TOB AVG dataset as input. Code is available here. To get graph #1, simply comment out the call to the *AdjustMonthlyAnomalies *in the *main *function. To get graph #3, set coeffs[1] = 1.0 in the *OLSRegression *function.

[…] that with the tests I ran with the actual dataset in my last post, and you’ll notice it’s not a perfect match. However, the actual data has […]

Pingback by Testing the PHA with synthetic data – Part 1 « Troy's Scratchpad — January 14, 2011 @ 8:56 pm