基于向量在数据框中查找部分出现

时间:2016-06-01 11:05:25

标签: regex r dataframe data.table sql-like

我有一个数据框a和一个向量b(从另一个数据框派生)。现在,我想在b中找到来自向量a的所有出现 然而,遗憾的是,向量b有时会错过一个主角。

a <- structure(list(GSN_IDENTITY_CODE = c("01234567", "65461341", "NH1497", "ZH0080", "TP5146", "TP5146"), PIG_ID = c("129287133", "120561144", "119265685", "121883198", "109371743", "109371743" ), SEX_CODE = c("Z", "Z", "Z", "Z", "B", "B")), .Names = c("GSN_IDENTITY_CODE", "PIG_ID", "SEX_CODE"), row.names = c(NA, 6L), class = "data.frame")

> a
#      IDENTITY_CODE    PIG_ID SEX_CODE
#1          01234567 129287133        Z
#2          65461341 120561144        Z
#3            NH1497 119265685        Z
#4            ZH0080 121883198        Z
#5            TP5146 109371743        B
#6            TP5146 109371743        B

b <- c("65461341", "1234567", "ZH0080", "TP5146")

我的预期输出是:

a
#  GSN_IDENTITY_CODE    PIG_ID SEX_CODE
#1          01234567 129287133        Z
#2          65461341 120561144        Z
#4            ZH0080 121883198        Z
#5            TP5146 109371743        B

首次删除重复项时,它解决了一个问题,但是我仍然需要一种方法来选择包含vector b值的所有行,而我需要更多行:

a <- a[!duplicated(a$GSN_IDENTITY_CODE),]

不幸的是我不能使用%in%,因为它会带来重复并错过第一行,因为它不接受正则表达式&#39;:

> a[a$GSN_IDENTITY_CODE %in% b,]
#  GSN_IDENTITY_CODE    PIG_ID SEX_CODE
#2          65461341 120561144        Z
#4            ZH0080 121883198        Z
#5            TP5146 109371743        B
#6            TP5146 109371743        B

使用data.table&#39; %like%仅适用于向量b

中的第一个字符串
library(data.table)
> setDT(a)
> a[a$GSN_IDENTITY_CODE %like% b,]
#   GSN_IDENTITY_CODE    PIG_ID SEX_CODE
#1:          65461341 120561144        Z
Warning message:
In grepl(pattern, vector) :
  argument 'pattern' has length > 1 and only the first element will be used

R中是否有支持我需求的功能?

@ Frank的尝试产生以下错误:

a <- structure(list(GSN_IDENTITY_CODE = c("01234567", "65461341", "NH1497", "ZH0080", "TP5146", "TP5146"), PIG_ID = c("129287133", "120561144", "119265685", "121883198", "109371743", "109371743" ), SEX_CODE = c("Z", "Z", "Z", "Z", "B", "B")), .Names = c("GSN_IDENTITY_CODE", "PIG_ID", "SEX_CODE"), row.names = c(NA, 6L), class = "data.frame")

b <- c("65461341", "1234567", "ZH0080", "TP5146")

> a[.(b), on="GSN_IDENTITY_CODE", nomatch=FALSE, mult="first"]
Error in `[.data.frame`(a, .(b), on = "GSN_IDENTITY_CODE", nomatch = FALSE,  : 
  unused arguments (on = "GSN_IDENTITY_CODE", nomatch = FALSE, mult = "first")
> setDT(a)
> a[.(b), on="GSN_IDENTITY_CODE", nomatch=FALSE, mult="first"]
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
  x.'GSN_IDENTITY_CODE' is a character column being joined to i.'NA' which is type 'NULL'. Character columns must join to factor or character columns.

1 个答案:

答案 0 :(得分:1)

如果额外字符可能出现在字符串中的任何位置,则可以执行类似这样的操作:

library(stringdist)
library(purrr)


a$closest_match <- map(a$GSN_IDENTITY_CODE, ~stringdist(., b, method = "lv")) %>% 
  map_dbl(min)
a[a$closest_match < 2, ]

如果额外的角色总是在开头,我会做这样的事情:

library(stringr)

a$stripped_code <- str_replace(a$GSN_IDENTITY_CODE,"^\\d", "")

a$keep <- a$GSN_IDENTITY_CODE %in% b | a$stripped_code %in% b
a[a$keep, ]