Question

我想知道是否有一种更直接的方式来计算某种类型的变量而不是我通常采用的方法....

以下示例可能最好地解释了它。我有一个包含2列的数据框（水果和水果是否腐烂）。我想，对于每一行，添加例如腐烂的同一类水果的百分比。例如，苹果有4个条目，其中2个是烂的，因此苹果的每一行应为0.5。目标值（纯粹如图所示）包含在“所需结果”栏中。

我之前已经解决了这个问题 *在fruit变量上使用“ddply”命令（以sum / lenght为函数），创建一个新的3 * 2数据帧 *使用“merge”命令将这些值链接回旧数据帧。

这感觉就像一种迂回的方式，我想知道是否有更好/更快的方式做到这一点！理想的是一种通用的方法，如果一个而不是百分比需要确定所有的水果都烂了，任何水果腐烂等等....

非常感谢，

W

    Fruit Rotten Desired_Outcome_PercRotten
1   Apple      1                        0.5
2   Apple      1                        0.5
3   Apple      0                        0.5
4   Apple      0                        0.5
5    Pear      1                       0.75
6    Pear      1                       0.75
7    Pear      1                       0.75
8    Pear      0                       0.75
9  Cherry      0                          0
10 Cherry      0                          0
11 Cherry      0                          0

#create example datagram; desired outcome columns are purely inserted as illustrative of target outcomes
Fruit=c(rep("Apple",4),rep("Pear",4),rep("Cherry",3))
Rotten=c(1,1,0,0,1,1,1,0,0,0,0)
Desired_Outcome_PercRotten=c(0.5,0.5,0.5,0.5,0.75,0.75,0.75,0.75,0,0,0)
df=as.data.frame(cbind(Fruit,Rotten,Desired_Outcome_PercRotten))        
df

Answer 1

您只需使用ddply和mutate：

即可完成此操作

# changed summarise to transform on joran's suggestion
# changed transform to mutate on mnel's suggestion :)
ddply(df, .(Fruit), mutate, Perc = sum(Rotten)/length(Rotten))

#     Fruit Rotten Perc
# 1   Apple      1 0.50
# 2   Apple      1 0.50
# 3   Apple      0 0.50
# 4   Apple      0 0.50
# 5  Cherry      0 0.00
# 6  Cherry      0 0.00
# 7  Cherry      0 0.00
# 8    Pear      1 0.75
# 9    Pear      1 0.75
# 10   Pear      1 0.75
# 11   Pear      0 0.75

Answer 2

data.table超快，因为它通过引用更新。怎么用呢？

library(data.table)

dt=data.table(Fruit,Rotten,Desired_Outcome_PercRotten)

dt[,test:=sum(Rotten)/.N,by="Fruit"]
#dt
#     Fruit Rotten Desired_Outcome_PercRotten test
# 1:  Apple      1                       0.50 0.50
# 2:  Apple      1                       0.50 0.50
# 3:  Apple      0                       0.50 0.50
# 4:  Apple      0                       0.50 0.50
# 5:   Pear      1                       0.75 0.75
# 6:   Pear      1                       0.75 0.75
# 7:   Pear      1                       0.75 0.75
# 8:   Pear      0                       0.75 0.75
# 9: Cherry      0                       0.00 0.00
#10: Cherry      0                       0.00 0.00
#11: Cherry      0                       0.00 0.00

Answer 3

基础R中的一个解决方案是使用ave。

within(df, {
  ## Because of how you've created your data.frame
  ##   Rotten is actually a factor. So, we need to
  ##   convert it to numeric before we can use mean
  Rotten <- as.numeric(as.character(Rotten))
  NewCol <- ave(Rotten, Fruit)
})
    Fruit Rotten Desired_Outcome_PercRotten NewCol
1   Apple      1                        0.5   0.50
2   Apple      1                        0.5   0.50
3   Apple      0                        0.5   0.50
4   Apple      0                        0.5   0.50
5    Pear      1                       0.75   0.75
6    Pear      1                       0.75   0.75
7    Pear      1                       0.75   0.75
8    Pear      0                       0.75   0.75
9  Cherry      0                          0   0.00
10 Cherry      0                          0   0.00

或更短：

transform(df, desired = ave(Rotten == 1, Fruit))

ave应用的默认功能是mean，因此我没有将其包含在此处。但是，如果您想要做一些不同的事情，可以通过附加FUN = some-function-here来指定不同的功能。

Answer 4

由于ave已经出局，让我使用我选择的基本R函数添加一个解决方案：aggregate。

您可以通过以下方式获得所需的数据：

aggregate(as.numeric(as.character(Rotten)) ~ Fruit, df, mean)

但是，之后你需要merge（或者一件）：

merge(df, aggregate(as.numeric(as.character(Rotten)) ~ Fruit, df, mean))

在没有ddply和合并的情况下计算“组特征”

4 个答案: