As I suggested in the original post documenting my experience, one of my goals was to reproduce the USHCNv2 F52 dataset using the Pairwise Homogeneity Algorithm included in the software folder. I e-mailed a few of my questions from that post to Dr. Menne, who helpfully answered them. I thought I’d share these answers in case anybody else is trying to do something similar:
1) The PHA is run separately on Tmax & Tmin series to produce the USHCNv2 dateset. We then compute Tavg as (Tmax+Tmin)/2.
2) Unfortunately, we never put the .HIS station history files out. These files contain 3 different
sources for station history information:0 – the manual QC done by the original USHCNv1 team (static)
1 – information gleaned from the U.S. Cooperative Observer Summary of the Day dataset (not used)
2 – information extracted from the meta database (MMS) maintained at NCDC (in flux as updates and
corrections are added)We will look into bundling the history files with the rest of the code package sometime in the next year.
3) Yes, the production run uses the parameters supplied by the *.incl
files.A couple of other notes:
a full reprocess of the USHCN monthly temperature data by the PHA has not been executed since 28 May 2008. Recent data are simply being appended to output of the May 2008 output. In 2011, there will be a new release of the USHCNv2 (F52g) concurrent with the release of GHCN Monthly Version 3 (currently available as a beta release).
as we indicated in the Menne et al. 2009 USHCNv2 overview article, we use all Cooperative Observer monthly temperature series to homogenize the USHCNv2 subset. This greatly expands the neighbor pool for USHCN sites. Therefore, you would need to run the PHA on the full Coop temperature database circa May 2008 in order to reproduce the results on our ftp site. We will also look into including this Coop database as part of the code release early next year.
What this tells me is that I may need to put my attempts at reproducing the F52 dataset on the back-burner for the time being, although hopefully in this upcoming year I’ll be able to pick it back up again.
Update (1/9): In addition to the comment below, I see that Ruhroh has queried RomanM regarding the process of separately adjusting max vs. min temperatures. As he is a statistician and I am not, you may want to read his response here.
[…] 1/3: Many of these lingering questions have been answered in a subsequent post here. To summarize, I won’t be able to reproduce the F52 dataset because the HIS files are not […]
Pingback by Running the USHCNv2 Software (Pairwise Homogeniety Algorithm) « Troy's Scratchpad — January 3, 2011 @ 12:17 am
Troyca;
Point #1 seems to be amazing to me.
My impression of the point of the adjustments was to automatically detect station moves.
What does it say about the station moves when the Tmin is adjusted separately from Tmax ?
I just can’t imagine that the daytime Tmax thermometer is not moved simultaneously with the nightime Tmin thermometer.
Am I missing something here?
This seems like a really big deal to me…
Thanks for the ongoing diligence!
RR
Comment by Ruhroh — January 4, 2011 @ 9:28 am
Hi Ruhroh,
From my perspective, I don’t see anything in that process that immediately raises a red flag. In fact, I suspected this might be how it was run, in point #4 of my original post on the topic:
“Perhaps the F52 is formed by averaging F52_max and F52_min, which would mean running the algorithm on MAX and MIN separately?”
Since they put out the MIN and MAX along with the AVG, and all the other datasets (raw and TOB) have AVG merely calculated by using (MIN + MAX) / 2, it might have seemed odd if the F52 did not have this same property.
Furthermore, my understanding is that with certain instrument changes (or UHI effects), Tmin may be affected but Tmax not as much, or visa versa. Thus, if they tried to detect inhomogeneities using only the AVG dataset, the magnitude of these effects would be divided by 2 and thus harder to find. Those are the best reasons I can think of for doing it this way, though I am not a statistician and could not definitely comment on whether it is “right” or “wrong”…you may need to recruit others to answer that.
You do raise an interesting point, however, and perhaps when the HIS and additional COOP stations are released this year we can test what kind of difference it makes to simply run the PHA on TAVG vs. running it on TMIN and TMAX separately. I suspect there will be little difference, but if there IS a substantial discrepancy, your question may become very important. Or perhaps the PHA can be improved if the TMIN and TMAX runs had some knowledge of each other, and could reduce the number of “false positives”.
My greater concern is the fact that since the exact HIS files have not yet been released, it seems to me that there has been no external replication of the F52 dataset that is used in the GISS calculations. I might worry about this more but for the fact that the PHA even without these files and the COOP stations seems to reproduce the trend fairly well, and because they will be released in this upcoming year anyhow.
Comment by troyca — January 4, 2011 @ 4:11 pm