r - Formula interface for glmnet -

- April 15, 2014

in last few months i've worked on number of projects i've used glmnet package fit elastic net models. it's great, interface rather bare-bones compared r modelling functions. in particular, rather specifying formula , data frame, have give response vector , predictor matrix. lose out on many quality-of-life things regular interface provides, eg sensible (?) treatment of factors, missing values, putting variables correct order, etc.

so i've ended writing own code recreate formula/data frame interface. due client confidentiality issues, i've ended leaving code behind , having write again next project. figured might bite bullet , create actual package this. however, couple of questions before so:

are there issues complicate using formula/data frame interface elastic net models? (i'm aware of standardisation , dummy variables, , wide datasets maybe requiring sparse model matrices.)
is there existing package this?

well, looks there's no pre-built formula interface, went ahead , made own. can download github: https://github.com/hong-revo/glmnetutils

or in r, using devtools::install_github:

install.packages("devtools") library(devtools) install_github("hong-revo/glmnetutils") library(glmnetutils)

from readme:

some quality-of-life functions streamline process of fitting elastic net models glmnet, specifically:

glmnet.formula provides formula/data frame interface glmnet.

cv.glmnet.formula similar thing cv.glmnet.

methods predict , coef both above.

a function cvalpha.glmnet choose both alpha , lambda parameters via cross-validation, following approach described in page cv.glmnet. optionally cross-validation in parallel.

methods plot, predict , coef above.

incidentally, while writing above, think realised why nobody has done before. central r's handling of model frames , model matrices terms object, includes matrix 1 row per variable , 1 column per main effect , interaction. in effect, that's (at minimum) p x p matrix, p number of variables in model. when p 16000, common these days wide data, resulting matrix gigabyte in size.

still, haven't had problems (yet) working these objects. if becomes major issue, i'll see if can find workaround.

update oct-2016

i've pushed update repo, address above issue 1 related factors. documentation:

there 2 ways in glmnetutils can generate model matrix out of formula , data frame. first use standard r machinery comprising model.frame , model.matrix; , second build matrix 1 variable @ time. these options discussed , contrasted below.

using model.frame

this simpler option, , 1 compatible other r modelling functions. model.frame function takes formula , data frame , returns model frame: data frame special information attached lets r make sense of terms in formula. example, if formula includes interaction term, model frame specify columns in data relate interaction, , how should treated. similarly, if formula includes expressions exp(x) or i(x^2) on rhs, model.frame evaluate these expressions , include them in output.

the major disadvantage of using model.frame generates terms object, encodes how variables , interactions organised. 1 of attributes of object matrix 1 row per variable, , 1 column per main effect , interaction. @ minimum, (approximately) p x p square matrix p number of main effects in model. wide datasets p > 10000, matrix can approach or exceed gigabyte in size. if there enough memory store such object, generating model matrix can take significant amount of time.

another issue standard r approach treatment of factors. normally, model.matrix turn n-level factor indicator matrix n-1 columns, 1 column being dropped. necessary unregularised models fit lm , glm, since full set of n columns linearly dependent. usual treatment contrasts, interpretation dropped column represents baseline level, while coefficients other columns represent difference in response relative baseline.

this may not appropriate regularised model fit glmnet. regularisation procedure shrinks coefficients towards zero, forces estimated differences baseline smaller. makes sense if baseline level chosen beforehand, or otherwise meaningful default; otherwise making levels more similar arbitrarily chosen level.

manually building model matrix

to deal problems above, glmnetutils default avoid using model.frame, instead building model matrix term-by-term. avoids memory cost of creating terms object, , can noticeably faster standard approach. include 1 column in model matrix levels in factor; is, no baseline level assumed. in situation, coefficients represent differences overall mean response, , shrinking them 0 meaningful (usually).

the main downside of not using model.frame formula can relatively simple. @ moment, straightforward formulas y ~ x1 + x2 + ... + x_p handled code, x's columns present in data. interaction terms , computed expressions not supported. possible, should compute such expressions beforehand.

update apr-2017

after few hiccups, on cran.

Search This Blog

Sort