基于两个变量的Dplyr过滤

时间:2015-12-18 23:40:45

标签: r dplyr tidyr spread

我想使用dplyr来确定数据框中的哪些观察结果满足以下条件:

  • 在每个Group中,Var2的观察总和为Var1 == good,其中Var1 == bad大于library(dplyr) set.seed(seed = 10) df <- data.frame("Id" = 1:12, "Group" = paste(sapply(toupper(letters[1:3]), rep, times = 4,simplify = T)), "Var1" = sample(rep(c("good","bad"),times = 1000),size = 12), "Var2" = sample(rep(1:10, times = 1000),size = 12)) print(df) Id Group Var1 Var2 1 1 A good 6 2 2 A bad 9 3 3 A good 10 4 4 A good 7 5 5 B bad 9 6 6 B bad 1 7 7 B bad 6 8 8 B good 6 9 9 C good 1 10 10 C bad 8 11 11 C good 4 12 12 C bad 2
  • 的观察总和

这是玩具数据框:

group_by()

到目前为止,我已经确定我应该使用summarise()filter()keepers <- df %>% group_by(Group, Var1) %>% summarise(Total = sum(Var2)) %>% print() Source: local data frame [6 x 3] Groups: Group [?] Group Var1 Total (chr) (chr) (int) 1 A bad 9 2 A good 23 3 B bad 16 4 B good 6 5 C bad 10 6 C good 5 的某种组合,但我似乎无法绕过一个好方法做到这一点。这是我到目前为止所提出的:

Group

我应该采取哪些后续步骤?归根结底,分析应该返回“A”,因为它是唯一的Total good bad观察值比var table = "<tr><td><input type='hidden' class='hid_id' value='"+id+"' /> "+id+ "</td><td>"+document.getElementById("name_"+id).value+ "</td><td>"+document.getElementById("price_"+id).value+ "</td><td><input type='text' id='qua_"+id+ "' value='1' disabled='disabled' /></td><td><button>more</button></td></tr>"; 观察值更大。

2 个答案:

答案 0 :(得分:3)

如何使用spread而不是filter

> library(tidyr)
> df %>% group_by(Group, Var1) %>%
+    summarise(Total = sum(Var2)) %>%
+    spread(Var1,Total) %>%
+    filter(good>bad)
Source: local data frame [1 x 3]

  Group bad good
1     A   9   23

答案 1 :(得分:2)

data.table类似的选项。我们将'data.frame'转换为'data.table'(setDT(df)),按'Group','Var1'分组,得到'Var2'的sum,从'long'转换为'wide'并过滤'good'大于'bad'的行。

library(data.table)
dcast(setDT(df)[, sum(Var2) , by = .(Group, Var1)], 
               Group~Var1, value.var='V1')[good>bad]
#   Group bad good
#1:     A   9   23