Troy's Scratchpad

January 14, 2011

Testing the PHA with synthetic data – Part 1

Filed under: Uncategorized — troyca @ 8:56 pm

Well, now that I’m up a running with the USHCNv2 Pairwise Homogeneity Algorithm, I’ve begun my very amateur process of “testing” it using synthetic datasets.  Ruhroh suggests this in the comments in that previous post.  It should be noted that there are several tests described in MW09, likely done in a more professional manner, but I did not see anything that addressed the specific topic of UHI influence and how successful the PHA might be at removing this effect.  Most tests seemed to deal with removing stepwise inhomogeneities.
Generating the Synthetic Datasets

I’m sure that there are better ways to do this, but I’ll describe my basic process for generating the datasets.  For a closer look at the algorithm and everything described in this post, you can get the code/data package here.

1) First, we determine what will be the approximate temperature for each month in 6 major regions in the U.S. (their center points approximately uniformly distributed across the country).   This is done by adding a stochastic component to the specified underlying trend.  The stochastic component is determined by taking a random number from a Gaussian distribution with the standard deviation specified, then adding that to the previous month’s value multiplied by a specified weight (between 0.0-1.0).  After this “random” value for each month is determined, we have values that are centered on 0.  We then add in the underlying specified temperature trend.

2) We then generate the approximate temperatures for 60 “subregions”.  This is done by first getting the stochastic component for each month, as in step #1.  We then calculate a weighted average (based on distance from the center of the subregion) of all the 6 major region temperatures for that month, and add that to the “random” value.

3)  We then calculate the monthly temperatures for each station.  I have chosen to simply use the USHCNv2 station list as described in ushcn-v2-stations.txt, which conveniently allows me to re-use that metadata file.  We add a stochastic element to the weighted average of the subregions, but this time it is based on the squared distance from the center of a subregion.

4) Finally, although this likely has no effect with respect to the annual trend or PHA, I add in a base temperature and monthly cycle to make the temperatures appear more realistic.

Visual Inspection

Setting my underlying temperature to 0.02 degrees per year, you can see what looks like (to me) fairly reasonable approximations of that trend in both the 1 year and 5 year anomalies (the seeds are those used for the pseudo-random number generators):

You’ll noticed I’ve re-used my CalcAvgTemp Java code for this purpose, although I’ve commented out the adjustments made for UHI proxy variables.  In case you’re curious about the actual numbers corresponding to those graphs, I’ll include them here:

1 Year
Seed 0: .191 per Decade, R-squared .379
Seed 1: .232 per Decade, R-squared .504
Seed 2:    .187 per Decade, R-squared .356

5 Year
Seed 0: .194 per Decade, R-squared .715
Seed 1: .214 per Decade, R-squared .846
Seed 2: .193 per Decade, R-squared .713

One other thing I wanted to ensure was that nearby stations correlate better with one another than those far away.  Here is a graph showing these results:

Compare that with the tests I ran with the actual dataset in my last post, and you’ll notice it’s not a perfect match.  However, the actual data has inhomogeneities and other effects involved, and I think this is probably okay for these purposes.

After Running the PHA

You can compare the various  adj.PseudoSetX.txt files to their corresponding PseudoSetX.txt files to see the changes made by the PHA.  There were slight “corrections” made each year, but in general there were no gremlins that automatically adjusted the later years higher while lowering the early yearly.  In fact, I can’t really even graph the adjusted versus the original, because they are completely superimposed, with no difference in the decade trends down to 3 decimal places.  Instead, here’s a graph in the difference between adjusted versus unadjusted set by year:

There are some spikes that look weird at the beginning and the end, but if you’ll look at the Y-axis/scale you’ll notice that the differences are extremely tiny relative to the anomalies themselves and the actual trends.

Of course, the main source of concern is not necessarily that the PHA automatically increases the trend in a vacuum. Instead, worries generally involve the idea that the PHA does not properly remove the UHI bias, or — worse still — that the PHA actually bleeds UHI infected stations into other stations.  With this current set-up, my next parts with involve a closer look into that particular concern.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at

%d bloggers like this: