在x首先超过y的组内过滤

时间:2016-08-08 20:24:20

标签: r data.table dplyr

我有一个由

组成的数据库
  • 家庭社区的邻居ID(id_h),
  • 阻止主页块的ID(blk_h
  • 邻域的子地理位置),
  • 工作区块(blk_w),
  • 两者之间的通勤者流量(Flow),
  • 每个家庭社区的中位数通勤者(Med_C)和
  • 家庭邻里的累积工人流量(CumFlow)。

数据按blk_hblk_w(降序)之间的距离排序,按id_h分组。我需要对数据进行子集化,以便为CumFlow FIRST等于或超过Med_C的每个家庭邻域提取案例。

我尝试了各种dplyr功能,无法让它工作。这是一个例子:

df <- data.frame(
  id_h=c("A","A","A","A","B","B","B"),
  blk_h=c("A1","A1","A2","A2","B1","B2","B2"),
  blk_w=c("W1","W2","W3","W3","W1","W2","W2"),
  dist=c(4.3,5.6,7.0,8.7,5.2,6.5,6.8),
  Flow=c(3,6,3,7,5,4,2),
  CumFlow=c(3,9,12,19,5,9,11),
  Med_C=c(10,10,10,10,6,6,6)
)
df

我需要这样返回一个这样的表:

id_h  blk_h  blk_w  dist  Flow  CumFlow  Med_C
A     A2     W3     7.0   3     12       10
B     B2     W2     6.5   4     9        6

以下是我试图让这件事发生的一些事情: 尝试#1

library(dplyr)
df.g <- group_by(df, id_h) 
df.g2 <- filter(df.g, CumFlow == which.min(CumFlow >= Med_C))

尝试#2

library(data.table)
setDT(df)[, .SD[which.min(CumCount >= Med_C)], by = id_h]

尝试#3

library(dplyr)
test <- df %>% group_by(id_h) %>% filter(min(CumFlow) >= Med_C)

我认为我误解了如何使用which.min功能。任何意见是极大的赞赏。

4 个答案:

答案 0 :(得分:3)

两件事:

  • 您需要slice(取一个索引)而不是filter(需要布尔值),
  • 因为which.min的使用是奇数(它返回第一个值的索引等于最小值,并且你有很多1和0),你实际上需要which.max,因为你想要1的第一个值,即TRUE

所以

df %>% group_by(id_h) %>% 
  slice(which.max(CumFlow >= Med_C))
## Source: local data frame [2 x 7]
## Groups: id_h [2]
## 
##     id_h  blk_h  blk_w  dist  Flow CumFlow Med_C
##   <fctr> <fctr> <fctr> <dbl> <dbl>   <dbl> <dbl>
## 1      A     A2     W3   7.0     3      12    10
## 2      B     B2     W2   6.5     4       9     6

答案 1 :(得分:2)

你可以像这样使用dplyr

df %>% group_by(id_h) %>% 
  mutate(times_greater = cumsum(CumFlow >= Med_C)) %>% 
  filter(times_greater == 1)

答案 2 :(得分:2)

# Load package library(data.table) # Setup data df <- data.table( id_h=c("A","A","A","A","B","B","B"), blk_h=c("A1","A1","A2","A2","B1","B2","B2"), blk_w=c("W1","W2","W3","W3","W1","W2","W2"), dist=c(4.3,5.6,7.0,8.7,5.2,6.5,6.8), Flow=c(3,6,3,7,5,4,2), CumFlow=c(3,9,12,19,5,9,11), Med_C=c(10,10,10,10,6,6,6)) # Calculation df.out <- df[CumFlow >= Med_C, .SD[1], by = id_h] 解决方案如下所示:

df.out

> df.out id_h blk_h blk_w dist Flow CumFlow Med_C 1: A A2 W3 7.0 3 12 10 2: B B2 W2 6.5 4 9 6 看起来像这样:

{{1}}

答案 3 :(得分:1)

两个filter次来电可以解决这个问题。

使用group_by在每个id_h内工作,第一个filter返回data.frame,其中CumFlow大于或等于Med_C。第二个filter在每个id_h内返回CumFlow最低的行。这仅适用,因为数据已排序。为了使工作更加强大,您可以考虑在致电arrange后向group_by添加电话。

library(dplyr)

df <- data.frame(
  id_h    = c("A","A","A","A","B","B","B"),
  blk_h   = c("A1","A1","A2","A2","B1","B2","B2"),
  blk_w   = c("W1","W2","W3","W3","W1","W2","W2"),
  dist    = c(4.3,5.6,7.0,8.7,5.2,6.5,6.8),
  Flow    = c(3,6,3,7,5,4,2),
  CumFlow = c(3,9,12,19,5,9,11),
  Med_C   = c(10,10,10,10,6,6,6)
)
df

df %>%
group_by(id_h) %>%
filter(CumFlow >= Med_C) %>%
filter(CumFlow == min(CumFlow))