MET Practical Sessions, January 2012

Plotting with R

In this section, you will use the R statistics software package to produce a plot of a few results. R is open source software available for free from www.r-project.org. It has already been installed on the classroom machines for you. The linux version operates via a command line.

In your results file tutorial/out/point_stat/point_stat_run2_360000L_20070331_120000V_cts.txt, you have closest point (INTERP_PNTS = 1) verification for 2-meter temperature (FCST_LEV = Z2), above freezing (FCST_THRESH = >278) over two spatial areas - one for the eastern United States (VX_MASK = EAST) and another for the western United States (VX_MASK = WEST). These are lines 11 and 19 in this file. We will plot a couple of statistics for each region along with their associated confidence intervals.

To begin, you have to start up the R software by simply typing R. Execute the following commands on the command line, and be sure to read the inline comments:

    R

    #Now you will have the R prompt, so you are still issuing commands at a prompt, but this time inside of R.

    # Open a graphics device window

    X11()

    # Read in your data file

    mydata <- read.table("tutorial/out/point_stat/point_stat_run2_360000L_20070331_120000V_cts.txt", header=T)

    #Look at the column names of your new data table mydata

    names(mydata)

    # Pick out the two lines containing data to be plotted
    east_ind <- mydata$OBTYPE=="ADPSFC" &
    mydata$INTERP_MTHD=="UW_MEAN" &
    mydata$INTERP_PNTS ==1 &
    mydata$VX_MASK=="EAST" &
    mydata$FCST_VAR=="TMP" &
    mydata$FCST_LEV=="Z2" &
    mydata$FCST_THRESH==">278.000"

    west_ind <- mydata$OBTYPE=="ADPSFC" &
    mydata$INTERP_MTHD=="UW_MEAN" &
    mydata$INTERP_PNTS ==1 &
    mydata$VX_MASK=="WEST" &
    mydata$FCST_VAR=="TMP" &
    mydata$FCST_LEV=="Z2" &
    mydata$FCST_THRESH==">278.000"

    #Plot POD vs POFD (sometimes called receiver operating characteristic when many thresholds are used)
    # for freezing / non-freezing event in 2 domains: eastern US (EAST) and western US (WEST)

    plot(mydata$POFD[east_ind], mydata$PODY[east_ind], xlab="Probability of False Detection", ylab="Probability of Detection", xlim=c(0,1), ylim=c(0,1))
    points(mydata$POFD[west_ind], mydata$PODY[west_ind], pch=16)

    #Add no skill line to plot and label

    abline(0,1,lty=3)
    text(.52, .45, "No skill")

    #Now add the confidence bounds

    lines(c(mydata$POFD_NCL[east_ind], mydata$POFD_NCU[east_ind]), c(mydata$PODY[east_ind], mydata$PODY[east_ind]))
    lines(c(mydata$POFD[east_ind], mydata$POFD[east_ind]), c(mydata$PODY_NCL[east_ind], mydata$PODY_NCU[east_ind]))
    lines(c(mydata$POFD_NCL[west_ind], mydata$POFD_NCU[west_ind]), c(mydata$PODY[west_ind], mydata$PODY[west_ind]))
    lines(c(mydata$POFD[west_ind], mydata$POFD[west_ind]), c(mydata$PODY_NCL[west_ind], mydata$PODY_NCU[west_ind]))

    legend(.6, .3, legend=c("EAST", "WEST"), pch=c(1,16), bty="n")


    #Now do the whole thing again, but this time write the plot to a file by specifying a different (jpeg) device first

    jpeg(filename="tutorial/out/plot1.jpeg")

    plot(mydata$POFD[east_ind], mydata$PODY[east_ind], xlab="Probability of False Detection", ylab="Probability of Detection", xlim=c(0,1), ylim=c(0,1))
    points(mydata$POFD[west_ind], mydata$PODY[west_ind], pch=16)

    #Add no skill line to plot and label

    abline(0,1,lty=3)
    text(.52, .45, "No skill")

    #Now add the confidence bounds

    lines(c(mydata$POFD_NCL[east_ind], mydata$POFD_NCU[east_ind]), c(mydata$PODY[east_ind], mydata$PODY[east_ind]))
    lines(c(mydata$POFD[east_ind], mydata$POFD[east_ind]), c(mydata$PODY_NCL[east_ind], mydata$PODY_NCU[east_ind]))
    lines(c(mydata$POFD_NCL[west_ind], mydata$POFD_NCU[west_ind]), c(mydata$PODY[west_ind], mydata$PODY[west_ind]))
    lines(c(mydata$POFD[west_ind], mydata$POFD[west_ind]), c(mydata$PODY_NCL[west_ind], mydata$PODY_NCU[west_ind]))

    legend(.6, .3, legend=c("EAST", "WEST"), pch=c(1,16), bty="n")

    #Close jpeg file

    dev.off()

    # Exit R
    quit()

Now look at the scatter plot you have produced:

    xv tutorial/out/plot1.jpeg&

It only has 2 points, and both have cross hatches showing the confidence bounds. The two variables displayed are probability of detection (POD) and probability of false detection (POFD). The POD gives the probability of detecting an event. In this case, our event is temperatures above freezing. POFD gives the probability of detecting a non-event (in this case, freezing temperatures).

This plot is conducive to a comparison of the forecast in the two regions. The POD for these two regions is very similar, and the confidence bounds overlap. Thus, they are essentially the same. However, the POFD intervals for the two regions do not overlap. Thus, one may conclude that the model does a better job of identifying areas with freezing temperatures (i.e. non-events) in EAST than in WEST.

You may notice that the confidence bounds for POD do not appear to be symmetric about the point. This is because probabilities are bounded between (0,1). A POD value near 0.5 will have a symmetric confidence interval while probabilities nearer 0 or 1 will have asymmetric confidence intervals.

If this plot had points for many temperature thresholds connected by a line, for example (263, 273, 283, 293, 303), then this plot would be called a receiver operating characteristic (ROC) curve. This a simple extension of this exercise.