Troy's Scratchpad

January 21, 2011

Testing the PHA with Synthetic Data – Part 3

Filed under: Uncategorized — troyca @ 7:00 pm

This is a continuation of Part 1 and Part 2.   Since these datasets are large, I’ve only put up the new synthetic data and the updated Java code here.  Older ones can be retrieved from Part 2.

For a quick summary, Part 1 simply ran the PHA on datasets with an underlying trend and a stochastic component.  Part 2 added in a strong UHI bias, and attempted to see how well the PHA was able to remove this bias.  I believe that post highlighted an area of concern — that the PHA can potentially remove traces of a UHI signal without removing the contamination itself — which had not, to my knowledge, been previously shown with the actual algorithm.

In this (final?) segment, I insert additional inhomogeneities (step and trend) to see how this affects the PHA.  This will go on top of the UHI contamination already present.

Adding Inhomogeneities

There are two separate steps here.

1) Based on the probability of a station change occurring within a year, we determine if a particular month will be the month that a station changes.  If so, a randomly selected step change and trend change are inserted at the month, and subsequent monthly temperatures are adjusted accordingly.

2) Furthermore, we determine if that station will undergo an “instrumentation” change based on the specified probability.  This instrumentation change will insert a cooling bias centered around 1985, to roughly simulate the CRS to MMTS change we see in the USCHNv2 dataset.

Now, if we perform our 1970-2000 pairwise test on the data set that includes the UHI contamination and additional inhomogeneities, we get the following graph:

We see a significantly worse correlation here than in Part 2, obviously due to the inhomogeneities occurring right in the middle of the period we’re examining.

Running the PHA

After running the PHA on the dataset with UHI contamination and inhomogeneities, here is what we get for the average yearly anomalies:

The trends for the various sets are as follows:

Without UHI or Inhom: 0.191 Degrees/Decade

With UHI: 0.389 Degrees/Decade

With UHI + Inhom: 0.370 Degrees/Decade

PHA Adjusted From UHI + Inhom: 0.350 Degrees/Decade

As you can see, we do get a small cooling bias from the inhomogeneities, and the PHA does adjust the trend downwards.  However, one interesting thing here to note is that the 0.350 Post-PHA trend is slightly larger than the 0.345 Post-PHA trend from Part 2, despite this part including the inhomogeneities with an overall cooling bias (and hence starting from a lower point).  This may be due to UHI contamination bleeding into more stations because of the additional change-points detected. 

So, how does it look if we try to search for our original UHI signal after the PHA?

Clearly, as predicted, the additional change-points have smeared this UHI signal even more.  We might expect that with more changepoints (or those of greater effect) this would get diluted further.


As I said in part 2, I think these tests show that we should not expect the PHA to be a miracle worker.  If there is a UHI signal in the RAW or TOB dataset that seems to disappear in the F52 dataset, it does not necessarily imply that the PHA has taken care of it.  It may simply be that the blending of nearby stations has removed the signal, without removing the bias itself.


January 19, 2011

Testing the PHA with synthetic data – Part 2

Filed under: Uncategorized — troyca @ 6:37 pm

This is a continuation of part 1, where I began testing the USHCNv2 PHA using synthetic data.  In that post I simply examined if the synthetic data looked reasonable, as well as if the PHA will automatically adjust to increase the temperature trend.  In this part, I examine a more interesting couple of questions, which is: 1) whether the PHA removes the effect of UHI in these datasets, and 2) whether the resulting output of the PHA (generally the F52 dataset) will hide a UHI signal (if we try to use our method of pairwise comparisons of nearby station trends to find it).

This post re-uses a lot of slightly modified goodies from the past, and the updated package can be found here.

Adding in UHI effects

The first steps of the data generation are the same as before.  However, in my tests here I also added in UHI contamination equal to the “true” underlying trend, thereby creating a dataset whose resulting trend appeared to be double the actual temperature trend.  Here are the steps:

1) I went through each station, and selected a trend* for the log of some economic indicator (or we could call it population) that lasted between 3 to 10 years.  I did this until I got fluctuating trends for the whole 100 year period for that station. The trend is randomly selected according to a Gaussian distribution with parameters for the mean and standard deviations specified.  Here is a look at the actual 1970-1980 log trend histogram for stations with NHGIS data:

NOTE: I actually found what appeared to be some issues with my 2000 NHGIS variable data by creating a chart like this, which I will discuss in a later post.

2) This economic indicator trend is then multiplied by another random number to specify how this corresponds to the UHI bias in the temperature trend.  The resulting temperature trend is then added to the station monthly temperatures.

Now, if we perform the same nearby station comparison that I’ve done in my many previous posts, we get a graph that looks like this:

You’ll notice that this looks a lot cleaner than the graphs with actual data.  This is likely a combination of the fact that a) actual UHI contamination is not as bad as simulated here, b) the variables we use as proxies for UHI are not as accurate as I’ve simulated here, and c) there is no missing data or other inhomogeneities added in at this point. 

After running the PHA

Here, I’ve taken the UHI infected “TOB” dataset and ran it through the PHA.  I’ll note that I only ran this on the AVG dataset (since that’s all I’ve created) rather than on a separate MAX and MIN datasets, as the official F52 results use.  The resulting dataset and my calculated yearly temperature anomalies can be seen in the code package included with the post.  However, here is a chart plotting the three datasets:

The resulting trends for the three datasets are:

With UHI: 0.389 Degrees/Decade
Without UHI: 0.191 Degrees/Decade
PHA Adjusted From UHI: 0.345 Degrees/Decade

On the plus side, the PHA reduced the effect of UHI contamination by 22%.  I will also note that this is only one particular interpretation of how UHI might contaminate a dataset.  My reading of MW09 suggests the the PHA might operate better if UHI effects occur in short steps and bursts rather than these 3-10 year trends (on the other hand, UHI contamination might operate even more gradually). 

Regardless, we see in this scenario that the PHA is NOT a miracle worker, and left a strong majority of the UHI bias within the resulting dataset.  This should not be a surprise, as if the underlying trend of the bulk of the stations (as we’ve implemented here) is increased because of UHI contamination, this will appear to be the “normal”/unbiased signal.     

So what about trying to recover our UHI signal?  If we run everything through the same pairwise test as before, we get the following graph:

Note the change in slope from 11.0 to 3.31.  Essentially, the PHA has reduced the apparent UHI impact by 70%, while only removing about 22% of the actual effect.  Therefore simply using the resulting F52 dataset in this case for our pairwise comparisons would result in an underestimate of the UHI effect.  It is possible that adding in various other inhomogeneities would increase the amount of changepoints, and therefore muddy the UHI waters even further, and this is what I’ll be investigating in Part 3.

It is quite probable that my resulting datasets and the effect of UHI do not reflect reality.  However, I believe these tests demonstrate that it is at least possible that the PHA adjustments hinder our ability to find a large portion of the UHI signal in the F52 dataset, WITHOUT actually removing the UHI contamination itself.  This should be kept in mind when trying to locate a UHI signal using these post-PHA-adjusted datasets.

January 14, 2011

Testing the PHA with synthetic data – Part 1

Filed under: Uncategorized — troyca @ 8:56 pm

Well, now that I’m up a running with the USHCNv2 Pairwise Homogeneity Algorithm, I’ve begun my very amateur process of “testing” it using synthetic datasets.  Ruhroh suggests this in the comments in that previous post.  It should be noted that there are several tests described in MW09, likely done in a more professional manner, but I did not see anything that addressed the specific topic of UHI influence and how successful the PHA might be at removing this effect.  Most tests seemed to deal with removing stepwise inhomogeneities.
Generating the Synthetic Datasets

I’m sure that there are better ways to do this, but I’ll describe my basic process for generating the datasets.  For a closer look at the algorithm and everything described in this post, you can get the code/data package here.

1) First, we determine what will be the approximate temperature for each month in 6 major regions in the U.S. (their center points approximately uniformly distributed across the country).   This is done by adding a stochastic component to the specified underlying trend.  The stochastic component is determined by taking a random number from a Gaussian distribution with the standard deviation specified, then adding that to the previous month’s value multiplied by a specified weight (between 0.0-1.0).  After this “random” value for each month is determined, we have values that are centered on 0.  We then add in the underlying specified temperature trend.

2) We then generate the approximate temperatures for 60 “subregions”.  This is done by first getting the stochastic component for each month, as in step #1.  We then calculate a weighted average (based on distance from the center of the subregion) of all the 6 major region temperatures for that month, and add that to the “random” value.

3)  We then calculate the monthly temperatures for each station.  I have chosen to simply use the USHCNv2 station list as described in ushcn-v2-stations.txt, which conveniently allows me to re-use that metadata file.  We add a stochastic element to the weighted average of the subregions, but this time it is based on the squared distance from the center of a subregion.

4) Finally, although this likely has no effect with respect to the annual trend or PHA, I add in a base temperature and monthly cycle to make the temperatures appear more realistic.

Visual Inspection

Setting my underlying temperature to 0.02 degrees per year, you can see what looks like (to me) fairly reasonable approximations of that trend in both the 1 year and 5 year anomalies (the seeds are those used for the pseudo-random number generators):

You’ll noticed I’ve re-used my CalcAvgTemp Java code for this purpose, although I’ve commented out the adjustments made for UHI proxy variables.  In case you’re curious about the actual numbers corresponding to those graphs, I’ll include them here:

1 Year
Seed 0: .191 per Decade, R-squared .379
Seed 1: .232 per Decade, R-squared .504
Seed 2:    .187 per Decade, R-squared .356

5 Year
Seed 0: .194 per Decade, R-squared .715
Seed 1: .214 per Decade, R-squared .846
Seed 2: .193 per Decade, R-squared .713

One other thing I wanted to ensure was that nearby stations correlate better with one another than those far away.  Here is a graph showing these results:

Compare that with the tests I ran with the actual dataset in my last post, and you’ll notice it’s not a perfect match.  However, the actual data has inhomogeneities and other effects involved, and I think this is probably okay for these purposes.

After Running the PHA

You can compare the various  adj.PseudoSetX.txt files to their corresponding PseudoSetX.txt files to see the changes made by the PHA.  There were slight “corrections” made each year, but in general there were no gremlins that automatically adjusted the later years higher while lowering the early yearly.  In fact, I can’t really even graph the adjusted versus the original, because they are completely superimposed, with no difference in the decade trends down to 3 decimal places.  Instead, here’s a graph in the difference between adjusted versus unadjusted set by year:

There are some spikes that look weird at the beginning and the end, but if you’ll look at the Y-axis/scale you’ll notice that the differences are extremely tiny relative to the anomalies themselves and the actual trends.

Of course, the main source of concern is not necessarily that the PHA automatically increases the trend in a vacuum. Instead, worries generally involve the idea that the PHA does not properly remove the UHI bias, or — worse still — that the PHA actually bleeds UHI infected stations into other stations.  With this current set-up, my next parts with involve a closer look into that particular concern.

January 6, 2011

USHCN Station Temperature Similarities by Distance

Filed under: Uncategorized — troyca @ 12:27 am

For my recent pairwise station tests, I’ve been comparing nearby stations with different “distance” thresholds.  I’ve thus been operating under the assumption that nearby stations will record temperatures that match better with each other than to far off stations.  I’m sure there are probably other studies out there that have done this, but I was curious to see what the relationship was between the correlation between stations and their distance from each other.

My first step was simply to perform an OLS regression between monthly temperatures at all station pairs within 1000 km to get an R^2 value, and then plot those R^2 values against the distance between these stations.  The chart below shows the results (1):

One thing that should stand out is the very high R^2 values, even at 1000 km distances.  If this is the entire story, it would seem that there would not be much dropoff at even greater distances.

Obviously, this is not the whole story.  We need only look at a chart of average temperatures by month in the US:

Clearly, the month to month variance between temperatures is going to dwarf the difference between station temperatures (especially given our offset/intercept), thus giving us an artificially high correlation.  This would not be a problem if I merely ran the OLS regression against the annual average temperature.  Furthermore, the order of ranking stations pairs according to R^2 (I think) should remain the same regardless of the correlation inflation.

However, to get a better sense of the “true” R^2 value, we should calculate the anomaly for each month relative to that specific month in other years, rather than to some other baseline.  So we adjust Jan at a station relative to all other Jan’s, Feb to all other Febs, etc.  This leads to the next graph, which looks more similar to expectations(2):

It is worth noting that since the OLS regression determines the best fit slope between two station temperature series, the graphs don’t necessarily require the trends between two stations to be similar in order to achieve a high R^2 value.  The graph below attempts to account for this by showing what R^2 would be if we required a coefficient/slope of 1.0 between the pairs of temperature series.  (In other words, we calculate an offset and then take the sum of squares of the difference between each pair of points and divide it by the variance in Y, then subtract it from 1.  It is therefore possible to have a
negative value, and the “R^2” value is no longer independent of which is the X vs Y series.) (3):


One of the reasons I attempted this experiment was to see if there was a set “cut-off” point to use a distance threshold.  Unfortunately, after looking at the charts above it is hard to say for certain where this cut-off point should be.

Note: The above charts were created using the USHCNv2 TOB AVG dataset as input.  Code is available here. To get graph #1, simply comment out the call to the AdjustMonthlyAnomalies in the main function.  To get graph #3, set coeffs[1] = 1.0 in the OLSRegression function.

January 3, 2011

More on running the USHCNv2 PHA (Response from Dr. Menne)

Filed under: Uncategorized — troyca @ 12:09 am

As I suggested in the original post documenting my experience, one of my goals was to reproduce the USHCNv2 F52 dataset using the Pairwise Homogeneity Algorithm included in the software folder.  I e-mailed a few of my questions from that post to Dr. Menne, who helpfully answered them.  I thought I’d share these answers in case anybody else is trying to do something similar:

1) The PHA is run separately on Tmax & Tmin series to produce the USHCNv2 dateset.  We then compute Tavg as (Tmax+Tmin)/2.

2) Unfortunately, we never put the .HIS station history files out. These files contain 3 different
sources for station history information:

0 – the manual QC done by the original USHCNv1 team (static)
1 – information gleaned from the U.S. Cooperative Observer Summary of the Day dataset (not used)
2 – information extracted from the meta database (MMS) maintained at NCDC (in flux as updates and
corrections are added)

We will look into bundling the history files with the rest of the code package sometime in the next year.

3) Yes, the production run uses the parameters supplied by the *.incl

A couple of other notes:

a full reprocess of the USHCN monthly temperature data by the PHA has not been executed since 28 May 2008.  Recent data are simply being appended to output of the May 2008 output. In 2011, there will be a new release of the USHCNv2 (F52g) concurrent with the release of GHCN Monthly Version 3 (currently available as a beta release).

as we indicated in the Menne et al. 2009 USHCNv2 overview article, we use all Cooperative Observer monthly temperature series to homogenize the USHCNv2 subset.  This greatly expands the neighbor pool for USHCN sites.  Therefore, you would need to run the PHA on the full Coop temperature database circa May 2008 in order to reproduce the results on our ftp site. We will also look into including this Coop database as part of the code release early next year.

What this tells me is that I may need to put my attempts at reproducing the F52 dataset on the back-burner for the time being, although hopefully in this upcoming year I’ll be able to pick it back up again.

Update (1/9): In addition to the comment below, I see that Ruhroh has queried RomanM regarding the process of separately adjusting max vs. min temperatures.  As he is a statistician and I am not, you may want to read his response here.

Blog at