Lab Notebook: Trials & Errors

So I think in trying out the R package “inlabru,” and then installing all sorts of experimental updates/upgrades to try to debug it, I broke the dependencies (or something) for my R-INLA install on the HPC system. Now I can’t run R-INLA models without getting some sort of error. I’m waiting to see if we can upgrade R on the cluster to the latest stable version, which might help with some of this.

I also tried to separate back out “route” and “observer” and model them separately.

A Primer on Crashing INLA Models

This post was originally written 09/28/2015 copied from my old blog.

There seems to be at least something to the fact that the models seem harder to fit with bigger data. Our theory is that INLA crashes the model isn’t a good fit. Luckily, if you search crash,” I’m far from the only one who has had this problem. I’m not sure yet if this is an answer, but so often in the GLMM and R world, it has something to do with an optimizer. I tried the “new” (as of April 2015) inla.control(correct=TRUE,correct.factor=10) which the authors say may become standard in a later version.

The thread I mentioned before even contains the same perplexing output I see when I run the models “verbose.” This person seems to be in a similar vein of thinking as some of my prior thoughts: he notices the models ran fine when the data set was smaller, but then the bigger set crashes. This is where I translated (maybe not correctly) the comment that follows suggesting “adding the option control.inla(diagonal=100) …also try values other than 100, such as 10 and 10000” into what they now have as a new feature (this post after all is from 2011). The author further qualifies that this option adjusts the Hessian optimization so as to make it more robust.

Further down this thread, there is a suggestion for giving INLA better starting values. That author builds on the prior suggestion by “complicating” the control.inla call like such…

 control.inla = list(diagonal = 100, strategy = "gaussian", int.strategy = "eb")  

This is the “cheapest” (his words) approximation, but it will do to at least point INLA in the right direction in the “real” call.

The things you could put in the option control.inla = list(strategy = “…”) when making your call to INLA can be…

  • laplace – this is the “full Laplace,” and is the most intense computationally. should be used if you need e.g. higher accuracy in the tails
  • simplified Laplace – default
  • Gaussian – the least intense
    • can be used if you’re having problems with models crashing
    • the “old way”
    • simplest approximation
    • “cheap”
Running Gaussian can significantly influence the results. So, this author uses that approximation as a starting point. He save that run as something like “starting.value,” and then adds…
 control.mode = list(result = starting.value, restart = TRUE)   

…to the normal INLA call. He further suggests “cooling” the diagonal if 100 is too small, and adjusting such that you start with, e.g. 1000, then down to 100, then down to 10. In the meantime, also found a few workarounds to make the INLA crashing process less painful on Windows.

Debugging my Script to Run on HPC

I have access to a supercomputer (and maybe more upcoming) and I’d like to try out making my workflow parallel. Back when I used HTCondor, I broke down my script to hypothetically do it, but ran into some snafus and didn’t quite end up implementing it on their system. Anyway, I’m ready to try again in a different environment. (File this one under lab notebook edition.) I’ve tweaked a number of things I’m trying to keep track of, lest my submission script breaks the workflow.

  • remember your model file must end with a newline character in order to correctly count the number of lines
  • do I need to set the MKL_NUM_THREADS variable?
    • apparently only if the script isn’t running parallel code on its own; otherwise this is where setting it to $NSLOTS would potentially overload the node
    • “If INLA is going to use OpenMP then I would suggest letting it have 36 cores and then setting MKL to 2.  This would put you at running (at most) 72 threads.  Currently our processors have hyperthreading turned on which allows them to operate ~2 threads per core, so running 72 cores worth of work shouldn’t be an issue and will likely give you a small boost to your speed.  Of course if you spend very little time in INLA code then it may make more sense to keep it at 1 and let MKL do the hyperthreading for you when it can.” – advice from my contact at the HPC center
  • can I still write out and append to a file at a specific location?
    • apparently yes…?
    • if not, change to dynamic naming: perhaps let the array script handle the output native as long as you have all of your identifying info written out, then combine
    • maybe I should change R script output to a print statement that will (hypothetically) go into the auto-generated output files?

To-do…

  • ask a collaborator to initiate a Globus Connect space
  • write out LOO dynamically
  • do I need to take advantage of the parallel functionality of R-INLA, or should I run single core?

Fastest Fitting for Bayesian Models?

Right now, I’m focused on computational efficiency (and thereby speed) of fitting Bayesian models. For my dissertation I used R-INLA, so in this “lab notebook” of a blog I need to jot down my thoughts around some current methodological hangups. Right now, even though it seems to be the fastest thing available, it’s honestly still not quite fast or  efficient enough for the full analysis I want to run. It’s taking too long for a re-analysis. I’m exploring whether or not it’s still the best thing available.

  • R-INLA
    • pros…
      • fast, as compared to other fitting algorithms
    • cons…
      • parallel processing (considering how much processing is sometimes required) can easily hog shared server resources: a postdoc I worked with and I talked about this a fair amount
      • crashes without clear indication of what happened: it’s a bit of a black box
      • can’t always estimate some parameters of interest
  • Hamiltonian Monte Carlo (Nov. 2016)
    • Stan: faster than BUGS (MCMC)
    • No-U-Turn Sampler
    • faster than INLA?
  • Bayesian variable selection regression (2017)

The Latest Bayes in R

What good fortune that I looked up advances in Bayesian methods the day a new task view was published! If you’re just starting with Bayes, they even have packages listed here designed to help you learn. I’ve been using R-INLA and for a few reasons I became curious what else was out there. I’m still assuming Laplace estimations are faster than Markov chain Monte Carlo, so there are more on the list that might be good to check out if I abandon that method altogether.

  • LaPlace’s Demon, iter-lap: This one appealed to me as an alternative to R-INLA
  • arm, RSGHB: might be worth checking out for hierarchical models
  • Bayes images: image analysis
  • Bayes meta, bspmma: meta-analysis
  • Bayes tree,tgp: Bayesian additive regression trees (BART)
  • Bayes QR: quantile regression
  • Bayes var-sel: variable selection
  • deal, ebdb Net, grain, network change (etc. see this link): network analysis
  • FME: an alternative to deSolve
  • hbsae: small area estimates
  • spBayes: spatial
  • spikeSlabGAM: geo-additive
  • spTimer: space-time
  • ramps: geo-statistical

There are also interesting choices for model averaging (and ensembles), time series, extreme value, clustering, threshold, change point and auto-regressive models.

Multilevel Models

My significant other, in getting his M.S. research ready for publication, introduced me to multilevel modeling, and we worked through a reanalysis together. Now, I’m also interested to apply this framework to my 2nd dissertation chapter before publishing, so I can model cohesive community response while still extracting species’ responses.

The following code block is copied straight from a forum post, but I wanted to save it in case the thread disappears someday. The author is converting multilevel models from lme to (Bayesian) R-INLA format.

“For each model I have first the lme formulation and then the formula for INLA” – Ignacio Quintero

# random intercept
m1 = lmer(dr ~ st.ele + (1|poId), data = sdat)
form = dr ~ st.ele + f(poId, model = 'iid')

#organize data
n.loc = max(sdat$poId)
sdat$i.int = sdat$poId
sdat$j.int = sdat$poId + n.loc
sdat$k.int = sdat$j.int + n.loc

# uncorrelated random intercept and random slope
m2 = lmer(dr ~ st.ele + (1|poId) + (0 + st.ele|poId), data = sdat)
form = dr ~ st.ele + f(i.int, model = 'iid') + f(j.int, st.ele, model = 'iid')

# correlated random intercept and random slope
m3 = lmer(dr ~ st.ele + (st.ele |poId), data = sdat)
form = dr ~ st.ele + f(i.int, model = 'iid2d', n = 2*n.loc) + f(j.int, st.ele, copy = 'i.int')

# uncorrelated random intercept and random slope of st.ele^2
m4 = lmer(dr ~ st.ele + I(st.ele^2) + (1|poId) + (0 + I(st.ele^2)|poId), data = sdat)
form = dr ~ st.ele + I(st.ele^2) + f(i.int, model = 'iid') + f(j.int, I(st.ele^2), model = 'iid')

I’m currently running a model selection per community, to determine which variables are most important across each avian community.

Spatio-temporal Models

A few tips and tricks along the way for spatial models…

  • you must have an observation for every feature
  • those observations must be in order of the shape file
  • the corresponding identifier can only be (1:length of shape file)

I’m not sure the spatial effect alone will be much more than a localized range map (but maybe I just need to think through it more)? I built an ecological regression model by adding a spatial term to my previous best-fit model:

model = abundance ~ f(ID,model="bym",graph="ch2.adj") + f(Year,model="rw2") + yearfix + f(ObsN, model="iid") + firstyear + SPI

I’m also looking at a spatial trend model:

model = abundance ~ f(ID,model="bym",graph="ch2.adj") + f(ID.area,Year,model="iid") + Year + f(ObsN, model="iid") + firstyear

It is possible to model nonlinear trends as well (though I’m not sure I totally understand the mechanism in the example, because it doesn’t seem like it necessarily has anything to do with the predictor).

I’m trying out models that are interactions of space and my co-variates of interest (i.e. weather and land cover), such as…

model = abundance ~ f(ID,model="bym",graph="ch2.adj") + f(Year,model="iid") + yearfix + f(ObsN, model="iid") + firstyear + f(ID.area,SPI,model="iid")

Bayesian Inference Ideas from R-INLA

I kind of hesitate to post “general” R-INLA posts because it’s a.) not my job so b.) I don’t really have time to give it the treatment it deserves. I don’t want to rewrite the website, yet, though I do want to make it more accessible. Right now, I’m mining the R-INLA site for biologically relevant ideas…

  • dispersal barrier models
  • spatial models of whatever phenomena (e.g. range maps)
  • spatio-temporal models of whatever phenomena (e.g. spatially varying relationships between response and covariate)
  • species distribution models
  • clustering analyses (? look more into this)

Until I have my own graphs and maps for the cover photo, here is where there is currently a R-INLA conference going on. I wish I could be there, for several reasons!

Bayesian Models with R-INLA

Fitting Bayesian models has gotten even better with the R package INLA. The fact that it’s integrated into R has led to more simplistic scripting, and the approach means much reduced computational time. There is a lot of literature out there about “why and how” the integrated nested Laplace approximation, which is why my treatment of it in this blog will likely be more practical. “The” paper about INLA for Bayesian stats starts out talking about the class of structured additive regression models. Fortunately, for non-statisticians like me, they throw me a bone and give me a list of examples from which I can pick out a few that I recognize.

  • (generalized) linear models
  • (generalized) additive models
  • smoothing spline models
  • state space models
  • semiparametric regression
  • spatial/temporal models
  • log-Gaussian Cox processes
  • geostatistical/additive models

The focus is on latent Gaussian models (about which there will be an entire conference this fall; the postdoc I work with and I have joked about going). As the name would imply, the latent field is Gaussian, “controlled by a few hyperparameters and with non-Gaussian response variable” (Rue et al. 2009).

Now, there are nice tutorials on the website! There are “volumes” that correspond to the OpenBUGS examples. The OpenBUGS write-ups are nice for explanation of the Bayesian stats behind the examples.

Another cool feature is that you can run it remotely within your R script. With that general question of why my models crash or how the fit is going, I found a reference to error checking that might help me diagnose.