从数据框A中查找字符串,其子字符串存在于B中

时间:2017-04-25 23:15:43

标签: r string substring string-comparison

我有两个数据帧A和B. A有完整的句子,B有我正在寻找的经常性短语。我想找到A中的所有行,其中字符串/部分字符串存在于数据帧B中。 例如,

Dataframe A有:

    "Sally is great"
     "John is great"
  "Sally likes peas"
 "John likes onions"
  "Jane is in Paris"
"Archie is in Paris"

Dataframe B有:

"in Paris"
"is great"

输出将是:

    "Sally is great"
     "John is great"
  "Jane is in Paris"
"Archie is in Paris"

因为这些是在数据帧B中存在字符串/子字符串的行。

相当于WHERE x LIKE'%substring%'在SQL中但是对于一组子字符串

我在A中接近200万行,在B中接近300,000行。我考虑过将str_match与循环一起使用,但考虑到数据大小,它可能不是一个可行的解决方案

2 个答案:

答案 0 :(得分:1)

一种方法是浏览较小集的元素,并使用grep检查它是否存在于较大的集合中。

big = c("Sally is great",
        "John is great",
        "Sally likes peas",
        "John likes onions",
        "Jane is in Paris",
        "Archie is in Paris")
small = c("in Paris",
          "is great")

big[unlist(lapply(small, function(a) grep(a, big)))]
#[1] "Jane is in Paris"   "Archie is in Paris" "Sally is great"     "John is great"     

答案 1 :(得分:1)

我们可以使用stri_detect

中的stringi
library(stringi)
big[stri_detect(big, regex = paste(small, collapse="|"))]
#[1] "Sally is great"     "John is great"      "Jane is in Paris"  
#[4] "Archie is in Paris"

数据

big <- c("Sally is great",
    "John is great",
    "Sally likes peas",
    "John likes onions",
    "Jane is in Paris",
    "Archie is in Paris")
small <- c("in Paris",
      "is great")