Silva SciDAV: Not really an acronym · Technology and Tools · R environments and argument passing

R environments and argument passing

Publication date: Jul 23, 2009 6:38:44 PM

(or, My God!  It's full of LISP!)

This article presupposes a reasonable knowledge of the R statistical programming environment. I highly recommend it, but you're not going to learn the basics here! I will, however, attempt to shed some light on some technical issues you might run into when scripting with formula objects. This would most likely be for model fitting and plotting using the excellent lattice package, but I'd be happy to hear about other applications. The only reason I'm writing about this is because I have not seen any coherent explanation of this stuff anywhere.  If you know of a better write-up somewhere, please let me know!

So, I've finally been driven to write something on this sciencey-blog thing I've been planning, because I just finished something that is so derned complicated that if I don't write it down now, I'm afraid I'll forget it. First off, here's a short list of functions that can melt your brain while writing R code:

environment
eval / substitute
parse / deparse
function

I know what you're thinking: "function! I use function all the time!" And of course I do too.  In particular, I write lots of functions for exploratory data analysis.  Let's say I have lots of data frames that I can load from different files, and I have some data pre-processing I'd like to do.  So, I write a little pre-process and plot function like this:

library(lattice)

breakme <- function(df, fo=y ~ x) {
    # I know that first row is garbage
    modified.df <- df[-1,]
    # I also want to compute some things here to label my data
    modified.df$big <- modified.df$y > 1

    xyplot(fo, modified.df, groups=modified.df$big, auto.key=T)
}

Now, when I pass in a data frame, I get a nice little plot.  For those of you copy-pasting at home:

my.df <- data.frame(y=c(100, 0.2, 0.3, 1.4, 1.3), x=c(1000, 1, 2, 1, 2))
breakme(my.df)

This makes a nice little plot, coloring the larger and smaller y-values differently and avoiding that nasty big first point. But, let's say I wanted to swap my axes, like so:

breakme(my.df, x ~ y)

R says, "Error in eval(expr, envir, enclos) : object 'modified.df' not found," and I spend many hours wrapping my head around the problem. 

Understanding the problem

The first step to unravelling the mystery is actually figuring out what the heck kind of magic xyplot is doing. I'd used the seemingly magical ability to use variable names in the groups argument for lattice plot functions for years, but never really understood what was going on. Now, it was time. R has a nice feature where you can see the code underlying almost any function by printing it (often by simply typing the function name at the R prompt).  But when I type xyplot I get the relatively uninformative:

function (x, data, ...) 
UseMethod("xyplot")
<environment: namespace:lattice>

No problem, right? I know that x is a formula in my case, so I just type xyplot.formula. To this R says, "Error: object 'xyplot.formula' not found." Fortunately, the internet eventually told me about non-visible methods.  So, if you type methods(xyplot), you'll actually see an asterisk next to xyplot.formula.  You can view non-visible methods using something like: getAnywhere(xyplot.formula).  Notably, right there at the top of the function definition, you see the line:

groups <- eval(substitute(groups), data, environment(x))

Oh no!  It's all the R functions that melt your brain in one line all together!  What's going on here is that R function arguments are not actually fully evaluated, but rather, they contain "promises." So, before the argument is used as a value, you can actually grab the symbolic expression, convert it to text or even reevaluate it in another context. It's like reaching back up the call stack with your zombie programmer hand to eat the brains of the code that called you.



                                         

In this case, xyplot is reevaluating the symbol you pass in groups within the context of the data frame in data and the environment of the formula in x.  When I use the default formula in breakme above, it has the environment of the function itself and everything is fine.  But if I pass in a different formula, all of a sudden, groups gets evaluated in the environment where you called breakme from (before we created the modified.df).

Solutions

In this simple case, the easiest thing would just be to change the groups argument to the bare symbol big.  But in my actual application, I was passing groups into breakme as well.  There are all kinds of fancy ways you can try to extract the appropriate information from an argument to breakme - but in the end, none of these will work, as xyplot is always going to look one level up the call stack.  The simplest solution (IMHO) is to change the environment of fo.  In the end I wrote the following atually useful code:

# A helper function kinda like xtabs w/ tapply-like functionality
tapply.formula <- function(fo, df, func=mean) {
    mf <- model.frame(fo, df)
    i <- attr(attr(mf, 'terms'), 'response')
    y <- mf[[i]]
    y.name <- colnames(mf)[i]
    by <- mf[-i]

    # This messes up our responseName, if it's not a valid symbol
    return(as.data.frame.table(tapply(y, by, func, na.rm=TRUE), 
                               responseName=y.name))
}

collapse.and.plot <- function(fo=rt ~ hand, df, groups=NULL, 
                              pdf.name=NULL, plotfunc=barchart, func=mean, ...) {
    # Create a more complete formula for tapply.formula
    fo2 <- fo
    if(mode(substitute(groups)) == 'name') {
        # This is a little gratuitous, but keeps parity with the lattice API
        groups.name <- deparse(substitute(groups))
        fo2 <- update(fo2, paste('. ~ . + ', groups.name))
    }
        
    reduced.df <- tapply.formula(fo2, df, func)

    pars <- list(fontsize=list(text=24))

    # This code is unlikely to work if you pass in anything else!
    if(mode(substitute(groups)) == 'name') {
        # Almost equivalent to eval(substitute(groups), reduced.df)
        groups <- reduced.df[,groups.name]
        gray.levels <- gray(seq(0.2, 0.8, along.with=levels(groups)))
        pars$superpose.polygon <- list(col=gray.levels)
    }

    # Make our plot function look in the right place and fix variable names
    environment(fo) <- environment()
    # Some crazy stuff because of the way names get mangled by
    # as.data.frame.table in tapply.formula
    fo[-1] <- parse(text=make.names(fo[-1]))

    # Now do the actual plotting stuff
    if(!is.null(pdf.name))
        pdf(pdf.name)

    trellis.par.set(pars)
    print(plotfunc(fo, reduced.df, groups=groups, 
             # ylab='RT (seconds)', 
             auto.key=TRUE, # or e.g. list(text=c('simple', 'sequential'))
            ...))
             # space='inside')) - don't like how this looks!

    if(!is.null(pdf.name))
        dev.off()

    return(reduced.df)
}