使用ID和结果变量传播重复的行

时间:2018-01-25 22:00:46

标签: r dplyr tidyr

感谢您的帮助。

我的问题与此thread非常相关。

请注意这个df:

df <- data.frame(id = c(1,1,2,3,4), fruit =  c("apple","pear","apple","orange","apple"))

我们可以像这样传播'虚拟变量':

df %>% mutate(i = 1) %>% spread(fruit, i, fill = 0) 

现在请注意添加重复的fruit时会发生什么。

df2 <- data.frame(id = c(1,1,2,3,4,4), fruit =  c("apple","pear","apple","orange","apple","apple"))

再次spread

df2%&gt;%mutate(i = 1)%&gt;%spread(fruit,i,fill = 0)

提供Error: Duplicate identifiers for rows (5, 6)

理想情况下,正确的结果会返回两个名为apple_1apple2的字段,这两个字段都应设置为1 id=4

1 个答案:

答案 0 :(得分:0)

您正在寻找类似的东西:

library(reshape2)    
df2 <- data.frame(id = c(1,1,2,3,4,4), fruit =  c("apple","pear","apple","orange","apple","apple"), stringsAsFactors = FALSE)
    > dcast(df2, id ~ fruit, value.var = 'fruit', fun.aggregate = list )
      id        apple orange pear
    1  1        apple        pear
    2  2        apple            
    3  3              orange     
    4  4 apple, apple 

另一种选择可能是:

> df2 %>%
  group_by(id) %>%
  mutate(fruit = paste(fruit, row_number(), sep = "_")) %>%
  dcast( id ~ fruit, value.var = "fruit", fun.aggregate = list )

  id apple_1 apple_2 orange_1 pear_2
1  1 apple_1                  pear_2
2  2 apple_1                        
3  3                 orange_1       
4  4 apple_1 apple_2 

如果每列优选0/1,则:

> df2 %>%
  group_by(id) %>%
  mutate(fruit = paste(fruit, row_number(), sep = "_")) %>%
  dcast( id ~ fruit, fill = 0 , fun.aggregate = function(x) 1 )
  id apple_1 apple_2 orange_1 pear_2
1  1       1       0        0      1
2  2       1       0        0      0
3  3       0       0        1      0
4  4       1       1        0      0