这非常棘手。假设我有第一个数据集df
:
sample id name
1 ID200,ID300,ID299 first
2 ID2,ID123 second
3 ID90 third
还有第二个数据集df_1
:
ids condition
ID200 y
ID300 n
ID299 n
ID2 y
ID123 y
ID90 n
我必须从第一个数据集中过滤所有ID值都满足第二个表上条件的所有行,例如y
。
因此,此示例中的过滤应为:
sample id name
2 ID2,ID123 second
我正在考虑使用类似的东西:
new_df = df %>%
filter(grepl('ID', id), df_1$condition == 'y')
但是显然我需要一些其他的东西,你能给我一些线索吗?
编辑:正如我在评论中所说,如果我在df的id列中填充了其他文本,会发生什么情况?
sample id name
1 ID = ID200,ID300,ID299,abcd first
2 ID = ID2,ID123, dfg second
3 ID = ID90, text third
答案 0 :(得分:1)
也许有点不雅致,但这将为您提供每个样本的最终状态。
library(tidyverse)
df <- tibble(sample = c(1, 2, 3),
id = c("ID200,ID300,ID299", "ID2,ID123", "ID90"),
name = c("first", "second", "third"))
df_1 <- tibble(ids = c("ID200", "ID300", "ID299", "ID2", "ID123", "ID90"),
condition = c("y", "n", "n", "y", "y", "n"))
df2 <- df %>%
mutate(ids = str_split(id, ",")) %>%
unnest() %>%
inner_join(df_1, by = "ids") %>%
group_by(sample) %>%
summarise(condition = min(condition))
然后您可以将其加入到原始数据框以进行过滤。
filtered <- inner_join(df, df2, by = "sample") %>%
filter(condition == "y")
答案 1 :(得分:1)
我将从整理df
开始,因为id
每行包含一个观察值:
library(tidyr)
library(dplyr)
df %>%
separate_rows(id)
sample id name
1 1 ID200 first
2 1 ID300 first
3 1 ID299 first
4 2 ID2 second
5 2 ID123 second
6 3 ID90 third
相同的操作,然后与df_1
联接:
df %>%
separate_rows(id) %>%
left_join(df_1, by = c("id" = "ids"))
sample id name condition
1 1 ID200 first y
2 1 ID300 first n
3 1 ID299 first n
4 2 ID2 second y
5 2 ID123 second y
6 3 ID90 third n
现在,您可以对sample
进行分组并过滤唯一条件为“ y”的情况:
new_df <- df %>%
separate_rows(id) %>%
left_join(df_1, by = c("id" = "ids")) %>%
group_by(sample) %>%
filter(condition == "y",
n_distinct(condition) == 1) %>%
ungroup()
结果:
sample id name condition
<int> <chr> <chr> <chr>
1 2 ID2 second y
2 2 ID123 second y
如果您真的想转换回原始格式,并在列中使用逗号分隔的ID:
library(purrr)
new_df %>%
nest(id) %>%
mutate(newid = map_chr(data, ~paste(.$id, collapse = ","))) %>%
select(sample, id = newid, name)
sample id name
<int> <chr> <chr>
1 2 ID2,ID123 second