Question

我有这样的数据：

object category country
495647 1        RUS  
477462 2        GER  
431567 3        USA  
449136 1        RUS  
367260 1        USA  
495649 1        RUS  
477461 2        GER  
431562 3        USA  
449133 2        RUS  
367264 2        USA  
...

其中一个对象出现在各种(category, country)对中，而且各个国家/地区共享一个类别列表。

我想在其中添加另一列，即每个国家/地区的类别权重 - 在类别的类别中显示的对象数量，在一个国家/地区内归一化为总计1（仅对唯一的总和{ {1}}对）。

我可以这样做：

(category, country)

然后从那里计算权重，但是直接对原始数据执行此操作的更有效和优雅的方法。

所需的示例输出：

aggregate(df$object, list(df$category, df$country), length)

对于唯一object category country weight 495647 1 RUS .75 477462 2 GER .5 431567 3 USA .5 449136 1 RUS .75 367260 1 USA .25 495649 1 RUS .75 477461 3 GER .5 431562 3 USA .5 449133 2 RUS .25 367264 2 USA .25 ...对，上述内容总计为国家/地区内的一个。

Answer 1

专门针对最后一句话做出回应：“直接对原始数据执行此操作的效率和优雅方式是什么。”恰好data.table有一个新功能。

install.packages("data.table", repos="http://R-Forge.R-project.org")
# Needs version 1.8.1 from R-Forge.  Soon to be released to CRAN.

将您的数据放在DT：

中

> DT[, countcat:=.N, by=list(country,category)]     # add 'countcat' column
    category country countcat
 1:        1     RUS        3
 2:        2     GER        1
 3:        3     USA        2
 4:        1     RUS        3
 5:        1     USA        1
 6:        1     RUS        3
 7:        3     GER        1
 8:        3     USA        2
 9:        2     RUS        1
10:        2     USA        1

> DT[, weight:=countcat/.N, by=country]     # add 'weight' column
    category country countcat weight
 1:        1     RUS        3   0.75
 2:        2     GER        1   0.50
 3:        3     USA        2   0.50
 4:        1     RUS        3   0.75
 5:        1     USA        1   0.25
 6:        1     RUS        3   0.75
 7:        3     GER        1   0.50
 8:        3     USA        2   0.50
 9:        2     RUS        1   0.25
10:        2     USA        1   0.25

:=通过引用数据添加列，是一个“旧”功能。新功能是它现在可以按组工作。 .N是一个符号，用于保存每个组中的行数。

这些操作具有内存效率，应扩展到大数据;例如，1e8，1e9行。

如果您不希望包含中间列countcat，请稍后将其删除。同样，这是一个有效的操作，无论表的大小如何（通过在内部移动指针）都能立即生效。

> DT[,countcat:=NULL]     # remove 'countcat' column
    category country weight
 1:        1     RUS   0.75
 2:        2     GER   0.50
 3:        3     USA   0.50
 4:        1     RUS   0.75
 5:        1     USA   0.25
 6:        1     RUS   0.75
 7:        3     GER   0.50
 8:        3     USA   0.50
 9:        2     RUS   0.25
10:        2     USA   0.25
>

Answer 2

前段时间我实际上问了a similar question。 data.table对此非常好，特别是现在：=按组实现，并且不再需要自联接 - 如上所示。基数R的最佳解是ave()。也可以使用tapply()。

这类似于上面的解决方案，使用ave()。但是，我强烈建议您查看data.table。

df$count <- ave(x = df$object, df$country, df$category, FUN = length)
df$weight <- ave(x = df$count, df$country, FUN = function(x) x/length(x))

Answer 3

我不会在一行中看到一种可读的方式。但它可以非常紧凑。

# Use table to get the counts.
counts <- table(df[,2:3])
# Normalize the table
weights <- t(t(counts)/colSums(counts))
# Use 'matrix' selection by names.
df$weight <- weights[as.matrix(df[,2:3])]

在R数据帧中应用每组的计算

3 个答案: