假设下表
Name Gender Place Age V1
Tom M NY 24 A
Nadia F AT 22 A
Alex M DE 42 B
Jodie F OH 18 B
Tom M NY 28 B
Alex F ID 32 B
Nadia F AT 34 A
Tom M OH 18 A
我想按名称和性别对表进行分组,使用连接列的多数投票替换地点和V1,使用数字均值替换年龄。结果应该是:
Name Gender Place Age V1
Tom M NY 23.3334 A
Nadia F AT 28 A
Alex M DE 42 B
Jodie F OH 18 B
Alex F ID 32 B
Tom(M)有三个条目,其中NY为两次,OH为一次。按照多数票,NJ更经常被选中。与V1中的A相同。年龄(24,28和18)的平均值是23.3334。
我使用dplyr得到了数值均值:
dt <- dt %>%
group_by_(.dots=lapply(names(dt)[c(1, 2)], as.symbol)) %>%
summarise_each(funs(mean))
并且可以在地点和V1分开进行多数投票:
dt$place<- dt[, names(which.max(table(place))), by = paste(name, gender)]
dt$V1 <- dt[, names(which.max(table(V1))), by = paste(name, gender)]
我的问题是性能。我有一个非常大的数据集,这些修改在多个步骤中花费的时间太长。至少使用某种应用函数来一步完成多数投票会很棒。最好的方法是将多数投票添加到dplyr函数中。
答案 0 :(得分:5)
我们创建vector
分组列名称(&#39; grpCol&#39;),使用setdiff
获取其余列名称(&#39; nm1&#39;) 。循环(sapply
)虽然&#39; nm1&#39;用于检查这些列中的哪些列是数字&#39; (is.numeric
)返回逻辑索引(&#39; indx&#39;)。
grpCol <- c('Name', 'Gender')
nm1 <- setdiff(names(df1), grpCol)
indx <- sapply(df1[nm1], is.numeric)
我们还创建了一个Mode
函数来返回具有最大频率的元素。
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
转换&#39; data.frame&#39;到&#39; data.table&#39; (setDT(df1)
),按照&#39; grpCol&#39;分组,我们使用&#39; indx&#39;循环遍历Data.table(.SD
)子集的子集。要为数字列返回mean
,为非数字列返回mode
,请连接(c
)以获得预期的输出。
setDT(df1)[,c(lapply(.SD[, names(indx)[indx], with=FALSE], mean),
lapply(.SD[, names(indx)[!indx], with=FALSE], Mode)) ,
by = grpCol]
# Name Gender Age Place V1
#1: Tom M 23.33333 NY A
#2: Nadia F 28.00000 AT A
#3: Alex M 42.00000 DE B
#4: Jodie F 18.00000 OH B
#5: Alex F 32.00000 ID B
或者@Frank在评论中提到,我们可以在if/else
内执行lapply
条件,而不是创建&#39; indx&#39;。
setDT(df1)[, lapply(.SD, function(x) {if(is.numeric(x)) mean(x)
else Mode(x)} ), by=.(Name,Gender)]
# Name Gender Place Age V1
#1: Tom M NY 23.33333 A
#2: Nadia F AT 28.00000 A
#3: Alex M DE 42.00000 B
#4: Jodie F OH 18.00000 B
#5: Alex F ID 32.00000 B
df1 <- structure(list(Name = c("Tom", "Nadia", "Alex", "Jodie", "Tom",
"Alex", "Nadia", "Tom"), Gender = c("M", "F", "M", "F", "M",
"F", "F", "M"), Place = c("NY", "AT", "DE", "OH", "NY", "ID",
"AT", "OH"), Age = c(24L, 22L, 42L, 18L, 28L, 32L, 34L, 18L),
V1 = c("A", "A", "B", "B", "B", "B", "A", "A")), .Names = c("Name",
"Gender", "Place", "Age", "V1"), class = "data.frame",
row.names = c(NA, -8L))
答案 1 :(得分:1)
以下是dplyr
方式
library(dplyr)
df1 %>%
group_by(Name, Gender) %>%
mutate(Age = mean(Age)) %>%
filter(Place == names(which.max(table(Place))) &
V1 == names(which.max(table(V1)))) %>% unique
# Name Gender Place Age V1
#1 Tom M NY 23.33333 A
#2 Nadia F AT 28.00000 A
#3 Alex M DE 42.00000 B
#4 Jodie F OH 18.00000 B
#5 Alex F ID 32.00000 B