Question

我有一个数据框，其中每一行都是唯一的用户，并且有一列用于表示每个用户的选择以及背景变量的列。在实际数据集中，有800多个选项，背景变量可以有2到60多个级别。

我正在做的是根据各种不同的分组组合查找每个选项的总计数。我已经成功了，但我现在想以一种有意义的方式命名它们。

对此问题的第一部分有帮助的来源： How to use aggregate with a list of column names

# sample data
d <- data.frame(user_id = 1:12, choice = rep(c("apple", "banana", "banana"),4), cond1 = c("a1", "a1", "a2", "a1", "a2", "a1", "a2", "a2", "a3", "a3", "a3", "a3"), gender = c(rep(c("F", "M"), 6)))

# list of the different grouping combinations (choice will always be one of these terms)
c1 <- list(c("choice", "cond1"), c("choice", "cond1","gender"))  

# variable to be aggregated over
v1 <- c("user_id")    

varNames <- c("cond1Ct", "cond1_GenderCt")         # for later use

counts <- mapply(FUN= function(z,y, ...) {aggregate(d[ , y], by=d[z], ...)},
       c1, v1, MoreArgs=list(FUN=length) )

# assign each part of the output to its own dataframe for later use
cond1 <- counts[[1]]
cond1_Gender <- counts[[2]]

所以这项工作到目前为止。

如Name columns within aggregate in R所示，可以为要聚合的变量指定列名。我无法在mapply函数调用中使用它。这是我的主要问题---如何自动为每个分组组合命名表格变量。

我的蛮力方法：

# rename the column labeled 'x' in a meaningful way ... should be possible to do using varnames (defined above)

cond1_Gender$cond1_Gender_Ct <- cond1_Gender$x          
cond1$cond1_Ct <- cond1$x
cond1_Gender$x <- NULL
cond1$x <- NULL

cond1_Gender

# bring back the now meaningfuly named columns into the original dataframe    
dUse <- merge(merge(d, cond1), cond1_Gender)  
     # efficiency of this for large dataset and 8+ aggregated datasets???



dUse_Cond <- dUse[!duplicated(dUse[c("choice", "cond1")]),]   # later: use item from list here!
# has extra columns that just get ignored ... but that could be potentially confusing for later users

# repeat for all other grouping variable combinations

# dUse_CondGender <- dUse[!duplicated(dUse[c1[[2]],])]   # (DOES NOT WORK!)

如何以更自动化的方式完成此命名，以便它可以跨数据文件和多个变量工作？

更大的图片：这只是我想要绘制每个分组组合的最佳选择频率的更大情况的一部分。我有一些代码可以确定每个分组组合中评分最高的职业，并为分组类别中的每个级别绘制这些频率。

增加3-25-14：澄清出现这种情况的背景，尽管可能超出了本问题的范围。

个人已经保存了一个选择（在长列表中）。我想知道（图形和数字），哪些选择是背景分组变量的不同组合的“最常见”（及其相应的频率）：即，五年级男性的最常见选择，或10年级学生的大多数频率选择在X州，或处理条件下的所有高中学生。我有一些代码可以引导我完成每个单独的背景分组：制表（在此问题中处理），找到“最常用”选项，计算该组频率的统计信息，以及根据该组中涉及的变量绘制图表。总的来说，我正在尝试使其更通用和矢量化，以便后续用户定义他们想要查看的背景变量的组合时更容易工作。 jlhoward的答案将每个所需组合的计数列添加到原始数据帧中。

谢谢！我希望这是有道理的......我在这里已经阅读了很多问题和答案，但之前没有尝试过。

Answer 1

我认为使用table函数（和类）及其as.data.frame方法生成适合合并的对象可以大大简化：

counts <- lapply(c1, function(cond) { as.data.frame( table(d[cond]))}) 
# That returns two 'Freq' vectors (named in the as.dataframe` step) in a list. 

d[order(d[2],d[3],d[4]), varNames] <- lapply(counts, function( cts) {
               merge(d[order(d[2],d[3],d[4]), ], cts )[['Freq']] })
#Could also have `cbind`-ed it. The `d[names] <-` assigned the names. 
#Could also have used `setNames` on the RHS. 

#------------

> d
   user_id choice cond1 gender cond1Ct cond1_GenderCt
1        1  apple    a1      F       2              1
2        2 banana    a1      M       2              2
3        3 banana    a2      F       3              2
4        4  apple    a1      M       2              1
5        5 banana    a2      F       3              2
6        6 banana    a1      M       2              2
7        7  apple    a2      F       1              1
8        8 banana    a2      M       3              1
9        9 banana    a3      F       3              2
10      10  apple    a3      M       1              1
11      11 banana    a3      F       3              2
12      12 banana    a3      M       3              1

我会承认，我试图让ave函数传递计数向量，但是它没有接受其索引参数的列表参数。我回顾了我开发的早期函数，并看到table确实接受了一个列表。我的第二个承认是我没有意识到对有序位置的赋值不会重新排序原始对象：

> a <- 10:1
> a[order(a)][2] <-100
> a
 [1]  10   9   8   7   6   5   4   3 100   1  # surprised me anyway.

表对象的as.data.frame方法只是从表条目中创建一个“长”数据框，其中Freq列包含计数：

 as.data.frame( table(d[-(1:3)]) )
#-----------------------
   gender cond1Ct cond1_GenderCt Freq
1       F       1              1    1
2       M       1              1    1
3       F       2              1    1
4       M       2              1    1
5       F       3              1    1
6       M       3              1    1
7       F       1              2    0
8       M       1              2    0
9       F       2              2    1
10      M       2              2    1
11      F       3              2    2
12      M       3              2    2
> table(d[-(1:3)]) 
, , cond1_GenderCt = 1

      cond1Ct
gender 1 2 3
     F 1 1 1
     M 1 1 1

, , cond1_GenderCt = 2

      cond1Ct
gender 1 2 3
     F 0 1 2
     M 0 1 2

Answer 2

您对mapply(...)的电话比以前更复杂：

counts   <- mapply(FUN= function(z,y) aggregate(d[ , y], by=d[z], FUN=length),
                      c1, "user_id")

如果您只想自动将每个数据框中的列x重命名为其他内容，则可以使用：

# rename all "x" columns
for (i in 1:length(counts)) 
  colnames(counts[[i]])[ncol(counts[[i]])]<- varNames[i]

为了解决您的核心问题，我需要看一个“图表显示每个分组组合的最佳选择频率的例子。

编辑（对OP评论的回应）

如果你的中间目标是将所有内容组合成一个单一的数据框，那么有一种更简单的方法。请注意，这会将名为x的聚合列留到最后。

counts   <- mapply(FUN= function(z,y) aggregate(d[ , y], by=d[z], FUN=length),
                   c1, "user_id")
mrg <- lapply(counts,function(df)merge(d,df)[,c("user_id","x")])
mrg <- do.call(cbind,lapply(mrg,function(df)merge(d,df,by="user_id")$x))
colnames(mrg) <- varNames
result <- cbind(d,mrg)
result
#    user_id choice cond1 gender cond1Ct cond1_GenderCt
# 1        1  apple    a1      F       2              1
# 2        2 banana    a1      M       2              2
# 3        3 banana    a2      F       3              2
# 4        4  apple    a1      M       2              1
# 5        5 banana    a2      F       3              2
# 6        6 banana    a1      M       2              2
# 7        7  apple    a2      F       1              1
# 8        8 banana    a2      M       3              1
# 9        9 banana    a3      F       3              2
# 10      10  apple    a3      M       1              1
# 11      11 banana    a3      F       3              2
# 12      12 banana    a3      M       3              1

首次使用lapply(...)

mrg <- lapply(counts,function(df)merge(d,df)[,c("user_id","x")])

创建一个数据框列表，其中每个数据框都将user_id与适当组合的计数相关联。然后，

mrg <- do.call(cbind,lapply(mrg,function(df)merge(d,df,by="user_id")$x))

将每个x列合并到user_id正确排序的单个数据框中。最后，

result <- cbind(d,mrg)

将列与原始数据框d组合在一起，该数据框已按user_id顺序排列。

同样，理解你的最终目标要好得多，因为几乎可以肯定的是，无需经历所有这一切就可以实现这一目标。

Answer 3

使用包plyr似乎大大简化了代码并处理了具有缺失值的分组变量和一个id有多个选择的实例（当我将其带回到更大的数据集时，这两个都出现了）

library (plyr)
d2 <- data.frame(user_id = 1:12, choice = rep(c("apple", "banana", "banana"),4), 
                 cond1 = c("a1", "a1", "a2", "a1", "a2", "a1", "a2", "a2", "a3", "a3", "a3", "a3"), 
                 gender = c(rep(c("F", "M"), 6)))

d2$user_id[7] <- 5         # modify the dataset some
d2$gender[10] <- NA

tmp1 <- ddply(d2, ~cond1 + gender + choice, summarize, cond1_GenderCt = length(choice))     
tmp2 <- ddply(d2, ~cond1 + choice, summarize, cond1Ct = length(choice))     
result2 <- merge (tmp2, merge(tmp1, d2))
result2

这会创建一个带有命名变量的数据框，这些变量会在每组分组变量中带回每个选项的频率。

编辑：所以我显然忘记了自己问题的要点！处理不同的变量组合。

doddply <- function(df, x){
  ddply(df,x,summarize,nChoice = length(choice))
}

lapply (c2, function (x) {doddply(d2, x)})

上面的doddply函数似乎是一个变体，它接受varNames列表以及“nChoice”的来源，并且由mapply调用会有所帮助，但我无法让它工作。

所以这最终与@jlhoward的解决方案完全相同......计数变量之后的代码仍然是命名和合并所需要的。（我现在离开这里只是另一种方式来达到这一点）。

R中命名输出列的聚合的mapply

3 个答案: