对包含不同类的数据框求和并替换具有相同名称R的列

时间:2015-01-08 11:49:25

标签: r

我有一个包含多个类的数据框,我想将那些具有相同名称且是数字的列相加,并用新的总和替换旧列,是否有人知道这样做的方法?

即我有一个数据框,如:

col1 col2  col3 col3 
char factor int int

我想制作

col1  col2  col3 
char factor 2int

我之前使用过:

data <- as.data.frame(do.call(cbind, by(t(data),INDICES=names(data),FUN=colSums)))

然而,这是在只有数字变量的数据框架上。

互联网上还有其他一些例子,但不符合以下条件:替换,保留帧的其余部分,以及在多个类的帧上

类似的问题:how do I search for columns with same name, add the column values and replace these columns with same name by their sum? Using R

1 个答案:

答案 0 :(得分:1)

尝试

dat1 <- dat #to keep a copy of the original dataset 
indx <- sapply(dat, is.numeric)#check which columns are numeric
nm1 <- which(indx)#get the numeric index of the column
indx2 <- duplicated(names(nm1))#check which among the
# integer columns are duplicated
#use `Map` after splitting the "nm1" with its "names", do the `rowSums`
dat[ nm1[!indx2]] <- Map(function(x,y) rowSums(x[y]), list(dat),
                                       split(nm1, names(nm1)))

 dat[ -nm1[indx2]]

更新

或者为了提高效率,只需要重复&#34;重复&#34;和&#34;数字&#34;列,而其他人完好无损。创建&#34;索引&#34; (indx2)重复的列。子集&#34; nm1&#34;基于&#34; indx2&#34;然后如上所述进行rowSums。最后,使用&#34; indx3&#34;

删除不需要的列(重复的列)
 indx2 <- duplicated(names(nm1))|duplicated(names(nm1),fromLast=TRUE)
 nm2 <- nm1[indx2]
 indx3 <- duplicated(names(nm2))
 dat[nm2[!indx3]] <- Map(function(x,y) rowSums(x[y]), 
                list(dat),split(nm2, names(nm2)))
 datN <- dat[ -nm2[indx3]]
 datN
 #    col1 col2 col3 col4 col5
 #1    16   23    2   10   10
 #2    10   18   12    8   18
 #3    21   23   15    6   10
 #4    14   37    3    5   15
 #5    29   39    5    1   11
 #6    26   31   14    2   20
 #7    25   31    2    8   10
 #8    36   31   12    8    6
 #9    32   26   13    6    4
 #10   16   38    1    7    3

检查结果

 rowSums(dat1[names(dat1) %in% 'col1'])
 #[1] 16 10 21 14 29 26 25 36 32 16
 rowSums(dat1[names(dat1) %in% 'col2'])
 #[1] 23 18 23 37 39 31 31 31 26 38

数据

dat <- structure(list(col1 = c(6L, 5L, 15L, 11L, 14L, 19L, 6L, 16L, 
17L, 6L), col2 = c(13L, 8L, 14L, 14L, 7L, 19L, 4L, 1L, 11L, 3L
), col3 = structure(c(2L, 5L, 8L, 3L, 4L, 7L, 2L, 5L, 6L, 1L), .Label = c("1", 
"2", "3", "5", "12", "13", "14", "15"), class = "factor"), col2 = c(7L, 
5L, 8L, 3L, 19L, 5L, 15L, 13L, 14L, 20L), col4 = structure(c(7L, 
6L, 4L, 3L, 1L, 2L, 6L, 6L, 4L, 5L), .Label = c("1", "2", "5", 
"6", "7", "8", "10"), class = "factor"), col5 = c(10L, 18L, 10L, 
15L, 11L, 20L, 10L, 6L, 4L, 3L), col1 = c(10L, 5L, 6L, 3L, 15L, 
7L, 19L, 20L, 15L, 10L), col2 = c(3L, 5L, 1L, 20L, 13L, 7L, 12L, 
17L, 1L, 15L)), .Names = c("col1", "col2", "col3", "col2", "col4", 
"col5", "col1", "col2"), row.names = c(NA, -10L), class = "data.frame")