r - Calculating multiple average and creating columns -
i have following table. groups , ages, create following columns. create columns need calculate mean each group , calculate grand mean. calculate grandmean- mean.
group1 group2 group3 age1 age2 age3 age4 diff1ag1 diff2ag1 diff3ag1 diff1ag2 diff2ag2 diff3ag2 diff1ag3 diff2ag3 diff3ag3 w q n 12 12 12 12 0.335 -0.08 -0.08 0.085 0.5 0.5 0 0.42 0.42 m q n 11 12 13 12 -0.335 -0.08 -0.08 -0.085 0.5 0.5 0 0.42 0.42 m p m 12 11 11 11 -0.335 0.08 0.08 -0.085 -0.5 -0.5 0 -0.42 -0.42 w p m 12 11 12 14 0.335 0.08 0.08 0.085 -0.5 -0.5 0 -0.42 -0.42 m p m 11 11 12 12 -0.335 0.08 0.08 -0.085 -0.5 -0.5 0 -0.42 -0.42
edit
the initial dataset i.e. ('df1') created using first 7 columns shown in op's post (i.e. 'age', , 'group' columns). so, if op's original dataset 'dat', df1 <- dat[1:7]
we using data.table
. convert 'data.frame' ('df1' - without 'diff' columns) 'data.table (setdt
) , create index column ('ind' - useful later sorting)
library(data.table) #data.table_1.9.5 setdt(df1)[,ind:=1:.n]
change dataset 'wide' 'long' format using `melt' have single 'age' column , 'ageg' group column ('dm'). create name vector 'group' columns ('nm1') , vector ('nm2') naming output columns.
dm <- melt(df1, id.vars=c(1:3,8), variable.name='ageg', value.name='age') nm1 <- grep('group', names(dm), value=true) nm2 <- sub('group', 'diff', nm1)
we loop through 'nm1' mean
of 'age' column ('tmp') grouped 'ageg' , 'group' column (nm1[j]
) in loop. create 'tmp2' mean
of 'tmp' grouped 'ageg' column. set 'key' columns in melted dataset 'ageg' , 'group' column (setkeyv(dm, c(nm1[j], 'ageg'))
) , dataset 'tmp' , 'tmp2' created columns ('dn2'). join datasets (setkeyv(dn, c(nm1[j], 'ageg'))[dm]
) , create 'diff' column based on difference between 'tmp' , 'tmp2' columns. remove columns not needed ([, 3:4 := null]
)
for(j in seq_along(nm1)){ dn <- dm[, list(tmp=mean(age)), c(nm1[j], 'ageg')][, tmp2:= mean(tmp), ageg] setkeyv(dm, c(nm1[j], 'ageg')) dm <- setkeyv(dn, c(nm1[j], 'ageg'))[dm][, nm2[j] := round(tmp-tmp2,3)][, 3:4 := null] }
now 'melted' dataset ('dm') additional 'diff' columns can converted 'long' 'wide' format using reshape
, change order
of rows 'ind' column (changed column order , names of 'res1` match expected output),
res1 <- reshape(dm, idvar=c('group1', 'group2', 'group3', 'ind'), timevar='ageg', direction='wide')[order(ind)][, c(3:1, 5, 9, 13, 17, 6:8, 10:12, 14:16, 18:20), with=false] ind1 <- grep('^age', names(res1)) setnames(res1, ind1, sub('.*\\.', '', names(res1)[ind1])) setnames(res1, sub('[.]', '', names(res1))) res1 # group1 group2 group3 age1 age2 age3 age4 diff1age1 diff2age1 diff3age1 #1: w q n 12 12 12 12 0.333 -0.083 -0.083 #2: m q n 11 12 13 12 -0.333 -0.083 -0.083 #3: m p m 12 11 11 11 -0.333 0.083 0.083 #4: w p m 12 11 12 14 0.333 0.083 0.083 #5: m p m 11 11 12 12 -0.333 0.083 0.083 # diff1age2 diff2age2 diff3age2 diff1age3 diff2age3 diff3age3 diff1age4 #1: 0.083 0.5 0.5 0 0.417 0.417 0.667 #2: -0.083 0.5 0.5 0 0.417 0.417 -0.667 #3: -0.083 -0.5 -0.5 0 -0.417 -0.417 -0.667 #4: 0.083 -0.5 -0.5 0 -0.417 -0.417 0.667 #5: -0.083 -0.5 -0.5 0 -0.417 -0.417 -0.667 # diff2age4 diff3age4 #1: -0.167 -0.167 #2: -0.167 -0.167 #3: 0.167 0.167 #4: 0.167 0.167 #5: 0.167 0.167
note: included 'diffage4' columns well. not in expected result.
update
this done using base r
methods. logic similar above. create 'ind' column in 'df1'. convert 'wide' 'long' using stack
base r
. name index vectors ('nm1', 'nm2') created before.
df1$ind <- 1:nrow(df1) dm <- cbind(df1[c(1:3,8)],setnames(stack(df1[4:7]), c('age', 'ageg'))) nm1 <- grep('group', names(dm), value=true) nm2 <- sub('group', 'diff', nm1)
here, change in using tapply
, ave
group mean
, difference between mean
per group , overall mean
each 'ageg' create 'diff' columns
for(j in seq_along(nm1)){ tmp <- with(dm, tapply(age, list(get(nm1[j]), ageg), fun=mean)) dm[nm2[j]] <- with(dm, ave(age, get(nm1[j]), ageg, fun=mean)) - rep(colmeans(tmp),each=nrow(df1)) }
now, reshape
(base r function) long
wide
described in data.table
solution
res2 <- reshape(dm, idvar=c('group1', 'group2', 'group3', 'ind'), timevar='ageg', direction='wide') res2 <- res2[,c(1:3,5, 9, 13, 17, 6:8, 10:12, 14:16, 18:20)] ind1 <- grep('^age', names(res2)) names(res2)[ind1] <- sub('.*\\.', '', names(res2)[ind1]) res2 # group1 group2 group3 age1 age2 age3 age4 diff1.age1 diff2.age1 diff3.age1 #1 w q n 12 12 12 12 0.3333333 -0.08333333 -0.08333333 #2 m q n 11 12 13 12 -0.3333333 -0.08333333 -0.08333333 #3 m p m 12 11 11 11 -0.3333333 0.08333333 0.08333333 #4 w p m 12 11 12 14 0.3333333 0.08333333 0.08333333 #5 m p m 11 11 12 12 -0.3333333 0.08333333 0.08333333 # diff1.age2 diff2.age2 diff3.age2 diff1.age3 diff2.age3 diff3.age3 diff1.age4 #1 0.08333333 0.5 0.5 0 0.4166667 0.4166667 0.6666667 #2 -0.08333333 0.5 0.5 0 0.4166667 0.4166667 -0.6666667 #3 -0.08333333 -0.5 -0.5 0 -0.4166667 -0.4166667 -0.6666667 #4 0.08333333 -0.5 -0.5 0 -0.4166667 -0.4166667 0.6666667 #5 -0.08333333 -0.5 -0.5 0 -0.4166667 -0.4166667 -0.6666667 # diff2.age4 diff3.age4 #1 -0.1666667 -0.1666667 #2 -0.1666667 -0.1666667 #3 0.1666667 0.1666667 #4 0.1666667 0.1666667 #5 0.1666667 0.1666667
update2
another dataset ('df2') more number of 'age' columns tested above base r
solution. op provided dataset .csv
file
df2 <- read.csv('book1.csv', stringsasfactors=false) #dput output below str(df2, list.len=3) #'data.frame': 5 obs. of 24 variables: # $ group1: chr "w" "m" "m" "w" ... #$ group2: chr "q" "q" "p" "p" ... #$ group3: chr "n" "n" "m" "m" ... #[list output truncated]
made adjustments column index
df2$ind <- 1:nrow(df2)
here, there 3 'group' columns ('group1', 'group2', 'group3') , 20 'age' columns ('age1', 'age2',....'age20'). so, can paste
subset dataset column names.
dm <- cbind(df2[c(paste0('group',1:3), 'ind')], setnames(stack(df2[paste0('age',1:20)]), c('age', 'ageg'))) dm$ageg <- with(dm, factor(ageg, levels=unique(ageg))) nm1 <- grep('group', names(dm), value=true) nm2 <- sub('group', 'diff', nm1) for(j in seq_along(nm1)){ tmp <- with(dm, tapply(age, list(get(nm1[j]), ageg), fun=mean)) dm[nm2[j]] <- with(dm, ave(age, get(nm1[j]), ageg, fun=mean)) - rep(colmeans(tmp),each=nrow(df2)) } res2 <- reshape(dm, idvar=c('group1', 'group2', 'group3', 'ind'), timevar='ageg', direction='wide')
the below steps needed if order of columns in output important.
ordind <- c(1:3, seq(5,ncol(res2), by=4), c(t(sapply(6:8, function(x) seq(x,ncol(res2), by=4))))) res3 <- res2[ordind] ind1 <- grep('^age', names(res3)) names(res3)[ind1] <- sub('.*\\.', '', names(res3)[ind1]) ncol(res3) #[1] 83
the first 7 columns of 'df2' same 'df1' dataset. so, results based on columns should match res2
.
res3[c(1:7, 24:35)] # group1 group2 group3 age1 age2 age3 age4 diff1.age1 diff2.age1 diff3.age1 #1 w q n 12 12 12 12 0.3333333 -0.08333333 -0.08333333 #2 m q n 11 12 13 12 -0.3333333 -0.08333333 -0.08333333 #3 m p m 12 11 11 11 -0.3333333 0.08333333 0.08333333 #4 w p m 12 11 12 14 0.3333333 0.08333333 0.08333333 #5 m p m 11 11 12 12 -0.3333333 0.08333333 0.08333333 # diff1.age2 diff2.age2 diff3.age2 diff1.age3 diff2.age3 diff3.age3 diff1.age4 #1 0.08333333 0.5 0.5 0 0.4166667 0.4166667 0.6666667 #2 -0.08333333 0.5 0.5 0 0.4166667 0.4166667 -0.6666667 #3 -0.08333333 -0.5 -0.5 0 -0.4166667 -0.4166667 -0.6666667 #4 0.08333333 -0.5 -0.5 0 -0.4166667 -0.4166667 0.6666667 #5 -0.08333333 -0.5 -0.5 0 -0.4166667 -0.4166667 -0.6666667 # diff2.age4 diff3.age4 #1 -0.1666667 -0.1666667 #2 -0.1666667 -0.1666667 #3 0.1666667 0.1666667 #4 0.1666667 0.1666667 #5 0.1666667 0.1666667
it same 'res2' based on 'df1'
data
df1 <- structure(list(group1 = c("w", "m", "m", "w", "m"), group2 = c("q", "q", "p", "p", "p"), group3 = c("n", "n", "m", "m", "m"), age1 = c(12l, 11l, 12l, 12l, 11l), age2 = c(12l, 12l, 11l, 11l, 11l), age3 = c(12l, 13l, 11l, 12l, 12l), age4 = c(12l, 12l, 11l, 14l, 12l)), .names = c("group1", "group2", "group3", "age1", "age2", "age3", "age4"), class = "data.frame", row.names = c(na, -5l)) df2 <- structure(list(group1 = c("w", "m", "m", "w", "m"), group2 = c("q","q", "p", "p", "p"), group3 = c("n", "n", "m", "m", "m"), age1 = c(12l, 11l, 12l, 12l, 11l), age2 = c(12l, 12l, 11l, 11l, 11l), age3 = c(12l, 13l, 11l, 12l, 12l), age4 = c(12l, 12l, 11l, 14l, 12l), age5 = c(11l, 10l, 9l, 17l, 12l), age6 = c(13l, 11l, 12l, 12l, 1l), age7 = c(1l, 1l, 1l, 1l, 1l), age8 = c(0l, 0l, 0l, 0l, 0l), age9 = c(12l, 13l, 12l, 3l, 1l), age10 = c(12l, 13l, 12l, 3l, 1l), age11 = c(12l, 13l, 12l, 3l, 1l), age12 = c(12l, 13l, 12l, 3l, 1l), age13 = 1:5, age14 = c(12l, 11l, 12l, 12l, 12l), age15 = c(12l, 12l, 12l, 12l, 1l), age16 = c(15l, 13l, 12l, 12l, 2l), age17 = c(3l, 4l, 5l, 6l, 6l), age18 = c(0l, 12l, 12l, 12l, 9l), age19 = c(12l, 12l, 12l, 12l, 12l), age20 = c(0l, 12l, 12l, 12l, 9l), ind = 1:5), .names = c("group1", "group2", "group3", "age1", "age2", "age3", "age4", "age5", "age6", "age7", "age8", "age9", "age10", "age11", "age12", "age13", "age14", "age15", "age16", "age17", "age18", "age19", "age20", "ind"), row.names = c(na, -5l), class = "data.frame")
Comments
Post a Comment