我是R编程的新手,在满足过滤条件后尝试删除每组行中的某些行。
方案:对于每个GROUP,如果连续有2个类型“ B”,请删除该GROUP的所有以下行。 “包含在数据集中”列显示了输出内容。
这是我的示例输入:
GROUP TYPE Include in DataSet?
--------------------------------------------
1 A yes
1 A yes
1 B yes
1 B yes
1 B no
2 A yes
2 B yes
2 B yes
2 A no
2 B no
2 B no
DF = structure(list(GROUP = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), TYPE = c("A", "A", "B", "B", "B", "A", "B", "B", "A",
"B", "B"), inc = c("yes", "yes", "yes", "yes", "no", "yes", "yes",
"yes", "no", "no", "no")), .Names = c("GROUP", "TYPE", "inc"), row.names = c(NA,
-11L), class = "data.frame")
预期输出:
GROUP TYPE Include in DataSet?
--------------------------------------------
1 A yes
1 A yes
1 B yes
1 B yes
2 A yes
2 B yes
2 B yes
我尝试编写一些代码,但由于分组问题而没有运气。
i=1
j=2
x <- allrows
for (i in x){
for(j in x){
if(i==j){
a$REMOVE=1
}
else{
a$REMOVE=2
}
}
}
答案 0 :(得分:8)
您可以通过创建一个新变量来标识“双B”行,然后过滤出组中第一“双B”行之后的行来实现此目的:
library(dplyr)
df %>%
group_by(GROUP) %>%
# Create new variable that tests if each row and the one below it TYPE==B
mutate(double_B = (TYPE == 'B' & lag(TYPE) == 'B')) %>%
# Find the first row with `double_B` in each group, filter out rows after it
filter(row_number() <= min(which(double_B == TRUE))) %>%
# Optionally, remove `double_B` column when done with it
select(-double_B)
# A tibble: 7 x 3
# Groups: GROUP [2]
GROUP TYPE IncludeinDataSet
<int> <chr> <chr>
1 1 A yes
2 1 A yes
3 1 B yes
4 1 B yes
5 2 A yes
6 2 B yes
7 2 B yes
正如@Frank在评论中指出的那样,您无需创建double_B
变量:您只需在内部的which
语句中测试“双B”条件即可filter
:
df %>%
group_by(GROUP) %>%
# Find the first row with `double_B` in each group, filter out rows after it
filter(row_number() <= min(which(TYPE == 'B' & lag(TYPE) == 'B')))
此外,如果在组中未找到“双B”条件,它将返回警告,但仍会正确过滤
答案 1 :(得分:3)
这可以通过将“ TYPE”的当前值与下一个“ TYPE”的值进行检查以找到数字索引来完成,使用seq_len
来获取从1到该数字的序列以对行进行子集设置( slice
内
library(dplyr)
df1 %>%
group_by(GROUP) %>%
slice(seq_len(which((TYPE == "B") & lead(TYPE) == "B")[1] + 1))
# A tibble: 7 x 3
# Groups: GROUP [2]
# GROUP TYPE IncludeInDataSet
# <int> <chr> <chr>
#1 1 A yes
#2 1 A yes
#3 1 B yes
#4 1 B yes
#5 2 A yes
#6 2 B yes
#7 2 B yes
df1 <- structure(list(GROUP = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), TYPE = c("A", "A", "B", "B", "B", "A", "B", "B", "A",
"B", "B"), IncludeInDataSet = c("yes", "yes", "yes", "yes", "no",
"yes", "yes", "yes", "no", "no", "no")), class = "data.frame",
row.names = c(NA, -11L))
答案 2 :(得分:1)
另一种方法可能是:
library(dplyr)
library(data.table)
df %>%
group_by(GROUP, rleid(TYPE)) %>%
mutate(temp = seq_along(TYPE)) %>%
ungroup() %>%
group_by(GROUP) %>%
filter(row_number() <= min(which(TYPE == "B" & temp == 2))) %>%
select(GROUP, TYPE, IncludeInDataSet)
答案 3 :(得分:0)
这是基本的R解决方案:
subset(DF, as.logical(ave(DF$TYPE,DF$GROUP, FUN= function(x)
seq_along(x) <= which((sequence(rle(x=="B")$length) * (x=="B")) %in% 2)[1])))
# GROUP TYPE inc
# 1 1 A yes
# 2 1 A yes
# 3 1 B yes
# 4 1 B yes
# 6 2 A yes
# 7 2 B yes
# 8 2 B yes