我有一个包含3列和数百行的数据框。特定列包含三个字符串之一:“打开”,“关闭”,“取消”
type unique_id group
1 Open 11468329881 g_2
2 Close 11468329881 g_2
3 Open 23254429881 g_3
4 Cancel 23254429881 g_3
5 Open 32550829881 g_4
6 Close 32550829881 g_4
7 Open 43254429881 g_5
8 Close 43254429881 g_5
9 Open 52627629881 g_6
10 Close 52627629881 g_6
11 Open 62747029881 g_7
12 Close 62747029881 g_7
13 Open 2499619881 g_8
14 Close 2499619881 g_8
15 Open 32975019881 g_9
16 Close 32975019881 g_9
17 Open 42975119881 g_10
18 Cancel 42975119881 g_10
19 Open 53560019881 g_11
20 Open 53560019881 g_11
21 Open 62521619881 g_12
22 Close 62521619881 g_12
23 Open 72663719881 g_13
24 Close 72663719881 g_13
25 Open 82663819881 g_14
26 Close 82663819881 g_14
27 Open 92747019881 g_15
28 Open 92747019881 g_15
29 Open 1499629881 g_15
30 Close 1499629881 g_15
我想循环遍历每个组(例如:g_1
,g_2
)并对行进行子集化,如果订单是“打开”,“关闭”或“打开”,“取消”任何其他顺序应该被忽略。
例如g_2应该是子集
type unique_id group
1 Open 11468329881 g_2
2 Close 11468329881 g_2
和g_11
应该被忽略,因为订单是“Open”“Open”
g_15
应该是子集
type unique_id group
29 Open 1499629881 g_15
30 Close 1499629881 g_15
任何帮助都将不胜感激。
编辑:如果我之前不清楚,我道歉。对于下面给出的样本,建议的解决方案不适用于g_8
Open 21921312463 g_1
Close 21921312463 g_1
Open 31032312463 g_2
Close 31032312463 g_2
Open 41032212463 g_3
Close 41032212463 g_3
Open 51032312463 g_4
Close 51032312463 g_4
Open 61032212463 g_5
Close 61032212463 g_5
Open 71032312463 g_6
Close 71032312463 g_6
Open 81032212463 g_7
Close 81032212463 g_7
Open 21921312463 g_8
Open 21921312463 g_8
Close 21921312463 g_8
Open 31032312463 g_9
Close 31032312463 g_9
Open 41032212463 g_10
Close 41032212463 g_10
Open 51032312463 g_11
Close 51032312463 g_11
Open 61032212463 g_12
Close 61032212463 g_12
Open 71032312463 g_13
Close 71032312463 g_13
Open 81032212463 g_14
Close 81032212463 g_14
我希望g_8被过滤以提供
Open 21921312463 g_8
Close 21921312463 g_8
并忽略组中的第一行
答案 0 :(得分:4)
按“群组”进行分组后,filter
行检查all
(vector
)或c("Open", "Close")
中的元素是否|
(c("Open", "Cancel")
)存在%in%
'类型'列
library(dplyr)
df1 %>%
group_by(group) %>%
#group_by(group, unique_id) %>%
filter(all(c("Open", "Close") %in% type)| all(c("Open", "Cancel") %in% type))
如果分组变量包含“unique_id”,请使用group_by
更新group_by(group, unique_id)
行
根据更新的数据集和新逻辑,我们检查下一个值,看它是“关闭”还是“取消”
df2 %>%
group_by(group, unique_id) %>%
mutate(ind = which(type == "Open" & lead(type) %in% c("Close", "Cancel"))[1]) %>%
filter(!is.na(ind)) %>%
slice(ind[1]:(ind[1]+1)) %>%
select(-ind)