As my analysis has brought me to this point, I thought I might as well share my experience with running the USHCNv2 Software (specifically the PHA) that produces the F52 output.
Overall, I stuck primarily to the quickstart guide, and the process is documented very well. There were a couple issues that were excerbated by my lack of familiarity with Linux, so hopefully this description will help others going along this path.
Getting Started
Since I use a Windows machine (I know, I know) I decided to simply try compiling and running the software in a Virtual Machine. Both VMWare Player and an Ubuntu 10.10 ISO image are free (although you do need to sign up with an account for WMWare) so I went with those options.
There were 3 apps that I needed to [sudo apt-get install] on this Ubuntu machine in order to run the PHA: csh (since only BASH is included), gawk, and g77 (for Fortran compilation, since I wanted to use the same compiler they recommended).
Getting g77 installed was a little tricky, since apparently it is not included/compatible with the latest version of gcc. It could not be found within the normal packages in sources.list either. This forum post contains a tip on how to get it, but I’ll reproduce the steps here: (use caution since I understand this reverts your version of gcc as well)
- Ensure you have aptitude (sudo apt-get install aptitude)
- Edit sources.list to include the following references at the bottom: (sudo gedit ~/etc/apt/sources.list)
deb http://us.archive.ubuntu.com/ubuntu/ hardy universe
deb-src http://us.archive.ubuntu.com/ubuntu/ hardy universe
deb http://us.archive.ubuntu.com/ubuntu/ hardy-updates universe
deb-src http://us.archive.ubuntu.com/ubuntu/ hardy-updates universe
- Update aptitude references and install (sudo aptitude update, then sudo aptitude install g77)
Preparing Inputs
Now, one of the irritating things with this process is that the INPUT files (temperature and history) are by station, rather than combined into one large file with all the stations, such as those that we get from the USHCN v2 temperature site. Furthermore, the .HIS files seem to be in a very different format from the one station.history file I found here, and I could not find where the .HIS files used to generate the F52 dataset might be located.
Neither could I find any quick way to generate these HIS and individual station temperature files. Thus, I wrote a Java app that will take arguments for the USHCNv2 temperature file (such as TOB), station history file, and station meta-file (e.g. ushcn-v2-stations.txt), and will output the collection of HIS and temperature files per station, and a station meta-file that has all stations without temperature data removed from it. Right now I only export the temperature files with the suffix “_avg.raw”, although this can be easily changed to min/max and tob if desired.
I’m still not sure if these HIS files are in the correct format. There seems to be some discrepency between the only sample HIS provided (“011084.his”) and the description of format of this HIS file (which according to “USHCN.v2.0.20090806.doc” is “SHF_tob_inst.txt”). Furthermore, the station.history file does not seem to include all the information that can possibly be entered in a HIS file, so my Java app will leave some fields blank.
Compiling and Running
From here, most things ran pretty well following the quickstart guide. However, there were three script files that seemed to have minor bugs:
gen_corr.sh, gen_homgog.sh, and gen_fillin.sh
All three of these have the same logic for processing the input type argument:
if [ $elem == “max” ]
then
export iel=1
export gel=”tmax”
echo “MAX”
elif [ $elem == “min” ]
then
export iel=2
export gel=”tmin”
echo “MIN”
elif [ $elem == “pcp” ]
then
export iel=4
export gel=”prcp”
echo “PCP”
else
echo “Unknown element: $elem”
exit
fi
The BASH interpreter does not seem to like the C/Java style “==”, and thus complains about the argument. Removing one of the ‘=’ does the trick. Furthermore, it is missing an entry for the “avg” choice, so you’ll need to add something like:
elif [ $elem = “avg” ]
then
export iel=3
export gel=”tavg”
echo “AVG”
Finally, I made the mistake of originally only allocating 512 MB of RAM to the VM. This meant any attempt to run the PHA on any “realworld” scenario was swiftly met by the OOM killer. I tried to fiddle around with the .incl files to lower the array sizes at first, which led to a whole host of errors. In the end, simply increasing the RAM for the VM got it to run ok, although it took a couple of hours.
Results
My original goal was to simply see if I could go from the TOB AVG dataset to the F52 AVG dataset. If you want to see the result of this test run, it is included in the package here. A quick diff, however, revealed a number of differences. This was further confirmed when I compared the result of the first station, which shows the USHCNv2 TOB.avg file, the USHCNv2 F52.avg file, and my test run using the USHCNv2 TOB.avg file as input:

As you can see, there is quite a difference. I’ll need to read up and run a few more tests to figure out why this is happening, but there are quite a few possibilities:
- I tried to go from TOB–>AVG, but I passed “raw” as the parameter to the scripts. My initial thought was that this simply told what file extension to look for, but perhaps extra processing is done as well.
- I don’t have the actual station HIS files for input.
- I don’t know exactly what values were set for various INCL parameters, such as “neighborhood” stations and the like.
- Perhaps the F52 is formed by averaging F52_max and F52_min, which would mean running the algorithm on MAX and MIN separately?
As I said, this takes several hours to run on my machine, so it’s not easy to quickly test where I might have gone wrong.
Not all is lost, however. For the period from 1970-2000, my average anomaly calculations (using 2.5×3.5 gridded) for the U.S. look almost exactly the same when comparing the F52 dataset downloaded from USHCNv2 vs. the F52 created from my run of the PHA (r-squared=.9998). Furthermore, both match up rather well with the GISS anomalies during that period:

The GISS slope is .288 C/Decade, the slope from my average calculated from USHCNv2 F52 is 0.298 C/decade, and the slope of my average calculated from my generated F52 is 0.292 C/decade. Compare that to the average I get from the inputted TOB dataset at 0.257 C/Decade, and it is clear that the PHA run did indeed seem to bring it more in line with the F52 we get from the USHCN website.
For my purposes this may be enough, and I may proceed with my analysis a bit more before coming back to some of these unanswered questions.
Update 1/3: Many of these lingering questions have been answered in a subsequent post here. To summarize, I won’t be able to reproduce the F52 dataset because the HIS files are not currently available, and because the official PHA run also uses a COOP temperature database that also has yet to be released.