r - Formula interface for glmnet -
in last few months i've worked on number of projects i've used glmnet
package fit elastic net models. it's great, interface rather bare-bones compared r modelling functions. in particular, rather specifying formula , data frame, have give response vector , predictor matrix. lose out on many quality-of-life things regular interface provides, eg sensible (?) treatment of factors, missing values, putting variables correct order, etc.
so i've ended writing own code recreate formula/data frame interface. due client confidentiality issues, i've ended leaving code behind , having write again next project. figured might bite bullet , create actual package this. however, couple of questions before so:
- are there issues complicate using formula/data frame interface elastic net models? (i'm aware of standardisation , dummy variables, , wide datasets maybe requiring sparse model matrices.)
- is there existing package this?
well, looks there's no pre-built formula interface, went ahead , made own. can download github: https://github.com/hong-revo/glmnetutils
or in r, using devtools::install_github
:
install.packages("devtools") library(devtools) install_github("hong-revo/glmnetutils") library(glmnetutils)
from readme:
some quality-of-life functions streamline process of fitting elastic net models
glmnet
, specifically:
glmnet.formula
provides formula/data frame interfaceglmnet
.cv.glmnet.formula
similar thingcv.glmnet
.- methods
predict
,coef
both above.- a function
cvalpha.glmnet
choose both alpha , lambda parameters via cross-validation, following approach described in pagecv.glmnet
. optionally cross-validation in parallel.- methods
plot
,predict
,coef
above.
incidentally, while writing above, think realised why nobody has done before. central r's handling of model frames , model matrices terms
object, includes matrix 1 row per variable , 1 column per main effect , interaction. in effect, that's (at minimum) p x p matrix, p number of variables in model. when p 16000, common these days wide data, resulting matrix gigabyte in size.
still, haven't had problems (yet) working these objects. if becomes major issue, i'll see if can find workaround.
update oct-2016
i've pushed update repo, address above issue 1 related factors. documentation:
there 2 ways in glmnetutils can generate model matrix out of formula , data frame. first use standard r machinery comprising
model.frame
,model.matrix
; , second build matrix 1 variable @ time. these options discussed , contrasted below.using model.frame
this simpler option, , 1 compatible other r modelling functions.
model.frame
function takes formula , data frame , returns model frame: data frame special information attached lets r make sense of terms in formula. example, if formula includes interaction term, model frame specify columns in data relate interaction, , how should treated. similarly, if formula includes expressionsexp(x)
ori(x^2)
on rhs,model.frame
evaluate these expressions , include them in output.the major disadvantage of using
model.frame
generates terms object, encodes how variables , interactions organised. 1 of attributes of object matrix 1 row per variable, , 1 column per main effect , interaction. @ minimum, (approximately) p x p square matrix p number of main effects in model. wide datasets p > 10000, matrix can approach or exceed gigabyte in size. if there enough memory store such object, generating model matrix can take significant amount of time.another issue standard r approach treatment of factors. normally,
model.matrix
turn n-level factor indicator matrix n-1 columns, 1 column being dropped. necessary unregularised models fit lm , glm, since full set of n columns linearly dependent. usual treatment contrasts, interpretation dropped column represents baseline level, while coefficients other columns represent difference in response relative baseline.this may not appropriate regularised model fit glmnet. regularisation procedure shrinks coefficients towards zero, forces estimated differences baseline smaller. makes sense if baseline level chosen beforehand, or otherwise meaningful default; otherwise making levels more similar arbitrarily chosen level.
manually building model matrix
to deal problems above, glmnetutils default avoid using
model.frame
, instead building model matrix term-by-term. avoids memory cost of creatingterms
object, , can noticeably faster standard approach. include 1 column in model matrix levels in factor; is, no baseline level assumed. in situation, coefficients represent differences overall mean response, , shrinking them 0 meaningful (usually).the main downside of not using
model.frame
formula can relatively simple. @ moment, straightforward formulasy ~ x1 + x2 + ... + x_p
handled code, x's columns present in data. interaction terms , computed expressions not supported. possible, should compute such expressions beforehand.
update apr-2017
after few hiccups, on cran.
Comments
Post a Comment