我有两个数据帧A和B. A有完整的句子,B有我正在寻找的经常性短语。我想找到A中的所有行,其中字符串/部分字符串存在于数据帧B中。 例如,
Dataframe A有:
"Sally is great"
"John is great"
"Sally likes peas"
"John likes onions"
"Jane is in Paris"
"Archie is in Paris"
Dataframe B有:
"in Paris"
"is great"
输出将是:
"Sally is great"
"John is great"
"Jane is in Paris"
"Archie is in Paris"
因为这些是在数据帧B中存在字符串/子字符串的行。
相当于WHERE x LIKE'%substring%'在SQL中但是对于一组子字符串
我在A中接近200万行,在B中接近300,000行。我考虑过将str_match与循环一起使用,但考虑到数据大小,它可能不是一个可行的解决方案
答案 0 :(得分:1)
一种方法是浏览较小集的元素,并使用grep
检查它是否存在于较大的集合中。
big = c("Sally is great",
"John is great",
"Sally likes peas",
"John likes onions",
"Jane is in Paris",
"Archie is in Paris")
small = c("in Paris",
"is great")
big[unlist(lapply(small, function(a) grep(a, big)))]
#[1] "Jane is in Paris" "Archie is in Paris" "Sally is great" "John is great"
答案 1 :(得分:1)
我们可以使用stri_detect
stringi
library(stringi)
big[stri_detect(big, regex = paste(small, collapse="|"))]
#[1] "Sally is great" "John is great" "Jane is in Paris"
#[4] "Archie is in Paris"
big <- c("Sally is great",
"John is great",
"Sally likes peas",
"John likes onions",
"Jane is in Paris",
"Archie is in Paris")
small <- c("in Paris",
"is great")