Question

我正在尝试根据另一个数据报中的列过滤数据帧的行。基本上，我想提取位置在开始和结束之间的具有相同ID的行。还有一个额外的技巧是ID的格式不同。
最后，脚本中涉及的数据量很大，因此可以节省内存或提高速度。
会很高兴获得一些提示。

library(dplyr)

df1 <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3), 
                  pos = c(30, 40, 50, 35, 45, 55, 60, 63, 39))

df2 <- data.frame(idstr = c("id1", "id1", "id3", "id4", "id4"), 
                  start=c(30, 20, 30, 40, 20 ),
                  end = c(40, 30, 50, 60, 45))

df.base <- df1[ paste0("id", df1$id) == df2$idstr && 
                 df1$pos >= df2$start &&
                 df1$pos <= df2$end,]

df.dplyr <- df1 %>%
            left_join(df2, by  = c('id' == 'idstr') ) %>%
            filter(pos >= start & pos <= end) %>%
            select(id, pos)

编辑：预期的输出，来自df1的行满足条件（它们的位置在具有相同id的df2范围内），因此，如果没有错误： id，pos
1，30
1，40
3、39

说明：例如df1 [3，] id == 1和pos == 50 查看df2，没有一行df2 $ id ==“ id1”且df2 $ start <= 50和df2 $ end> = 50，因此df1 [3，]将被过滤掉。

Answer 1

我们可以在data.table中使用非等额联接。在两个数据集中创建相似的“ id”，然后将{id1列与{e1}和非等号连接在一起，并与“ pos”和“ start”，“ end”列

on

Answer 2

我通过提取数字，将df1的2 DF df2和df2的突变列 idstr 转化为数字。然后使用left_join，group_by和filter得到结果。

library(dplyr)


df1 <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3), pos = c(30, 40, 50, 35, 45, 55, 60, 63, 39))

df2 <- data.frame(idstr = c("id1", "id1", "id3", "id4", "id4"), 
                  start=c(30, 20, 30, 40, 20 ),
                  end = c(40, 30, 50, 60, 45))


df2 %>% 
  mutate(idstr = as.numeric(stringr::str_extract(idstr, '[0-9]'))) %>% 
  left_join(df1, by = c('idstr' = 'id')) %>% 
  dplyr::filter(pos >= start & pos <= end)
#> # A tibble: 4 x 4
#> # Groups:   idstr [2]
#>   idstr start   end   pos
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1    30    40    30
#> 2     1    30    40    40
#> 3     1    20    30    30
#> 4     3    30    50    39

有一个df1$id == 1可以容纳在df2中的2个起始插槽中。因此，它必须是3个ID = = 1的职位。如果其中一个限制是排他性的（如以下代码所示），则符合您的愿望。


df2 %>% 
  mutate(idstr = as.numeric(stringr::str_extract(idstr, '[0-9]'))) %>% 
  left_join(df1, by = c('idstr' = 'id')) %>% 
  dplyr::filter(pos > start & pos <= end)

#>   idstr start end pos
#> 1     1    30  40  40
#> 2     1    20  30  30
#> 3     3    30  50  39

使用base或dplyr过滤几列上的数据框

2 个答案: