Question

考虑我有以下提到的输入字符；

text_input <- c("ADOPT", "A", "FAIL", "FAST")
test <- c("TEST", "INPUT", "FAIL", "FAST")

我想匹配两个输入并提取出text_input中常见的单词，我想要类似于str_extract的东西。

我确实知道str_extract使用匹配的模式或单词来做，但是我的测试数据包含大约500,000个单词。任何输入都会真正有帮助。

预期结果：

"FAIL", "FAST"

编辑

只需在此处添加一个问题...当Input是一个纯字符串，例如下面提供的字符串时，会发生什么情况？

text_input <- c("‘Data Scientist’ has been named the sexiest job of the 21st century by Harvard Business Review. The same article tells us that “demand has raced ahead of supply” and that the lack of data scientists “is becoming a serious constraint in some sectors.” A 2011 study by McKinsey Global Institute found that “there will be a shortage of talent necessary for organizations to take advantage of big data” – a shortage to the tune of 140,000 to 190,000 in the United States alone by 2018.")

test <- c("Data Scientist", "McKinsey", "ORGANIZATIONS", "FAST")

如上所述，即使在这种情况下也可以执行字符串匹配。

注意：更改了输入和测试字符串。

Answer 1

如果我们需要提取字符

library(stringr)
str_extract(text_input, paste0("[", test, "]+"))

如果我们正在寻找完全匹配的字符串

library(data.table)
fintersect(data.table(col1 = text_input), data.table(col1 = test))

Answer 2

举个简单的例子，您可以使用intersect()，如注释中所述。

text_input1 <- c("ADOPT", "A", "FAIL", "FAST")
test1 <- c("TEST", "INPUT", "FAIL", "FAST")
intersect(text_input1, test1)
# [1] "FAIL" "FAST"

长长的例子有点复杂。

text_input2 <- c("‘Data Scientist’ has been named the sexiest job of the 21st century by Harvard Business Review. The same article tells us that “demand has raced ahead of supply” and that the lack of data scientists “is becoming a serious constraint in some sectors.” A 2011 study by McKinsey Global Institute found that “there will be a shortage of talent necessary for organizations to take advantage of big data” – a shortage to the tune of 140,000 to 190,000 in the United States alone by 2018.")

phrases <- c("Data Scientist", "McKinsey", "ORGANIZATIONS", "FAST")

您定义的测试字符串向量-我将其称为phrases包含两个（或可能更多）单词的复合词，即包含空格。因此，我们需要一个可以处理它的正则表达式rx1。目前尚不清楚是否需要区分大小写的匹配，您需要tolower()的短语和文本。接下来，我们测试是否存在匹配项。如果是这样，我们将正则表达式扩展到rx2，以便可以与gsub()替换功能很好地使用它。我们Vectorize()的功能是它可以处理短语向量。

matchPhrase <- Vectorize(function(phr, txt, tol=FALSE) {
  rx1 <- gsub(" ", "\\\\s", phr)         # handle spaces
  if (tol) {                             # optional tolower
    rx1 <- tolower(rx1)
    txt <- tolower(txt)
  }
  if (regexpr(rx1, txt) > 0) {    # test for matches
    rx2 <- paste0(".*(", rx1, ").*") 
    return(gsub(rx2, "\\1", txt))        # gsub extraction
    } else {
      return(NA)                         # we want NA for no matches
      }
})

默认，不区分大小写。

matchPhrase(phrases, text_input2, tol=FALSE)
#   Data Scientist         McKinsey    ORGANIZATIONS             FAST 
# "Data Scientist"       "McKinsey"               NA               NA

不区分大小写的还会找到"organizations"。

matchPhrase(phrases, text_input2, tol=TRUE)
#   Data Scientist         McKinsey    ORGANIZATIONS             FAST 
# "data scientist"       "mckinsey"  "organizations"               NA

为获得清晰的输出，只需执行以下操作：

as.character(na.omit(matchPhrase(phrases, text_input2, tol=TRUE)))
# [1] "data scientist" "mckinsey"       "organizations"

注意：可能您需要针对特定需求/所需输出多次调整功能。实际上，quanteda软件包在执行此类操作方面非常复杂。

Answer 3

这也可以使用软件包fuzzyjoin实现，该软件包包含一种加入基于正则表达式的df的方法。

text_input <- c("ADOPT", "A", "FAIL", "FAST")
regex <- c("TEST", "INPUT", "FAIL", "FAST")

library(fuzzyjoin)
library(dplyr)

df <- tibble( text = text_input )
df.regex <- tibble( regex_name = regex )

# now we can regex match them
df %>%
  regex_left_join( df.regex, by = c( text = "regex_name" ) )

# # A tibble: 4 x 2
# text  regex_name
#   <chr> <chr>     
# 1 ADOPT NA        
# 2 A     NA        
# 3 FAIL  FAIL      
# 4 FAST  FAST 

#or only regex 'hits'
df %>%
  regex_inner_join( df.regex, by = c( text = "regex_name" ) )

# # A tibble: 2 x 2
# text  regex_name
#   <chr> <chr>     
# 1 FAIL  FAIL      
# 2 FAST  FAST

匹配两个字符串并提取R中匹配的字符

3 个答案: