R对数字列应用均值,对分类进行多数投票

时间:2015-08-29 07:52:35

标签: r data.table dplyr mean

假设下表

Name    Gender  Place Age V1
Tom     M       NY    24  A
Nadia   F       AT    22  A
Alex    M       DE    42  B
Jodie   F       OH    18  B
Tom     M       NY    28  B
Alex    F       ID    32  B
Nadia   F       AT    34  A
Tom     M       OH    18  A

我想按名称和性别对表进行分组,使用连接列的多数投票替换地点和V1,使用数字均值替换年龄。结果应该是:

Name    Gender  Place Age      V1
Tom     M       NY    23.3334  A
Nadia   F       AT    28       A
Alex    M       DE    42       B
Jodie   F       OH    18       B
Alex    F       ID    32       B

Tom(M)有三个条目,其中NY为两次,OH为一次。按照多数票,NJ更经常被选中。与V1中的A相同。年龄(24,28和18)的平均值是23.3334。

我使用dplyr得到了数值均值:

dt <- dt %>%
    group_by_(.dots=lapply(names(dt)[c(1, 2)], as.symbol)) %>%
    summarise_each(funs(mean))

并且可以在地点和V1分开进行多数投票:

dt$place<- dt[, names(which.max(table(place))), by = paste(name, gender)]
dt$V1 <- dt[, names(which.max(table(V1))), by = paste(name, gender)]

我的问题是性能。我有一个非常大的数据集,这些修改在多个步骤中花费的时间太长。至少使用某种应用函数来一步完成多数投票会很棒。最好的方法是将多数投票添加到dplyr函数中。

2 个答案:

答案 0 :(得分:5)

我们创建vector分组列名称(&#39; grpCol&#39;),使用setdiff获取其余列名称(&#39; nm1&#39;) 。循环(sapply)虽然&#39; nm1&#39;用于检查这些列中的哪些列是数字&#39; (is.numeric)返回逻辑索引(&#39; indx&#39;)。

grpCol <- c('Name', 'Gender')
nm1 <- setdiff(names(df1), grpCol)
indx <- sapply(df1[nm1], is.numeric)

我们还创建了一个Mode函数来返回具有最大频率的元素。

Mode <- function(x) {
 ux <- unique(x)
 ux[which.max(tabulate(match(x, ux)))]
}

转换&#39; data.frame&#39;到&#39; data.table&#39; (setDT(df1)),按照&#39; grpCol&#39;分组,我们使用&#39; indx&#39;循环遍历Data.table(.SD)子集的子集。要为数字列返回mean,为非数字列返回mode,请连接(c)以获得预期的输出。

setDT(df1)[,c(lapply(.SD[, names(indx)[indx], with=FALSE], mean),
      lapply(.SD[, names(indx)[!indx], with=FALSE], Mode)) , 
               by = grpCol]
#   Name Gender      Age Place V1
#1:   Tom      M 23.33333    NY  A
#2: Nadia      F 28.00000    AT  A
#3:  Alex      M 42.00000    DE  B
#4: Jodie      F 18.00000    OH  B
#5:  Alex      F 32.00000    ID  B

或者@Frank在评论中提到,我们可以在if/else内执行lapply条件,而不是创建&#39; indx&#39;。

setDT(df1)[, lapply(.SD, function(x) {if(is.numeric(x)) mean(x) 
                else Mode(x)} ),  by=.(Name,Gender)]
#    Name Gender Place      Age V1
#1:   Tom      M    NY 23.33333  A
#2: Nadia      F    AT 28.00000  A
#3:  Alex      M    DE 42.00000  B
#4: Jodie      F    OH 18.00000  B
#5:  Alex      F    ID 32.00000  B

数据

df1 <- structure(list(Name = c("Tom", "Nadia", "Alex", "Jodie", "Tom", 
"Alex", "Nadia", "Tom"), Gender = c("M", "F", "M", "F", "M", 
"F", "F", "M"), Place = c("NY", "AT", "DE", "OH", "NY", "ID", 
"AT", "OH"), Age = c(24L, 22L, 42L, 18L, 28L, 32L, 34L, 18L), 
V1 = c("A", "A", "B", "B", "B", "B", "A", "A")), .Names = c("Name", 
"Gender", "Place", "Age", "V1"), class = "data.frame",
row.names = c(NA, -8L))

答案 1 :(得分:1)

以下是dplyr方式

library(dplyr)

df1 %>% 
 group_by(Name, Gender) %>% 
 mutate(Age = mean(Age)) %>% 
 filter(Place == names(which.max(table(Place))) & 
           V1 == names(which.max(table(V1)))) %>% unique

#      Name Gender Place      Age V1
#1   Tom      M    NY 23.33333  A
#2 Nadia      F    AT 28.00000  A
#3  Alex      M    DE 42.00000  B
#4 Jodie      F    OH 18.00000  B
#5  Alex      F    ID 32.00000  B