这是对here
以下问题的跟进问题我有以下数据
数据:
df = structure(list(Org_ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
Market_volume = c(100L, 200L, 300L, 50L, 500L, 400L, 200L,
300L, 100L), Indicator_variable = c(1L, 0L, 0L, 1L, 1L, 0L,
0L, 0L, 0L),variable3=c(10L, 1L, 1L, 4L, 2L, 3L, 3L, 10L, 3L),variable4=c(2L, 1L, 1L, 7L, 2L, 3L, 3L, 8L, 3L)).Names = c("Org_ID", "Market_volume", "Indicator_variable","Var3","Var4"
), class = "data.frame", row.names = c(NA, -9L))
使用(dplyr),我通过以下函数按市场交易量按市场交易量计算了NA的百分比
df %>%
group_by(Org_ID) %>%
summarize(sum_market_vol = sum(Market_volume*!Indicator_variable),
tot_market_vol = sum(Market_volume)) %>%
transmute(Org_ID, Perc_Market_Vol = 100*sum_market_vol/tot_market_vol)
结果:
# A tibble: 3 x 2
Org_ID Perc_Market_Vol
<int> <dbl>
1 1 83.33333
2 2 0.00000
3 3 100.00000
问题: 我希望通过删除Org_ID的所有行(比如说2)#X if perc_market_vol&lt; 30来对原始数据进行子集化。那就是我不想删除相同org_id的各个行,但是整个Org_id,比如Org_id = 1或org_id = 2的所有计数。如何将它连接到两个表或函数的子集?
我希望新数据看起来像这样:
df1 = structure(list(Org_ID = c(1L, 1L, 1L, 3L, 3L, 3L, 3L),
Market_volume = c(100L, 200L, 300L, 400L, 200L,
300L, 100L), Indicator_variable = c(1L, 0L, 0L, 0L,
0L, 0L, 0L),variable3=c(10L, 1L, 1L, 3L, 3L, 10L, 3L),variable4=c(2L, 1L, 1L, 3L, 3L, 8L, 3L)).Names = c("Org_ID", "Market_volume", "Indicator_variable","Var3","Var4"
), class = "data.frame", row.names = c(NA, -7L))
答案 0 :(得分:0)
您可以使用group_by %>% filter
过滤而无需实现汇总数据框,并且在过滤器中您可以计算每组的汇总条件:
df %>%
group_by(Org_ID) %>%
filter(sum(Market_volume * !Indicator_variable)/sum(Market_volume) > 0.3)
# A tibble: 7 x 5
# Groups: Org_ID [2]
# Org_ID Market_volume Indicator_variable Var3 Var4
# <int> <int> <int> <int> <int>
#1 1 100 1 10 2
#2 1 200 0 1 1
#3 1 300 0 1 1
#4 3 400 0 3 3
#5 3 200 0 3 3
#6 3 300 0 10 8
#7 3 100 0 3 3