如何遍历关键字向量列表并使它们模糊匹配到另一个文件(R)

时间:2018-10-25 15:33:47

标签: r loops matching sapply grepl

我有两个文件,一个充满关键字(大约2,000行),另一个充满文本(大约770,000行)。关键字文件如下:

Event Name            Keyword
All-day tabby fest    tabby, all-day
All-day tabby fest    tabby, fest
Maine Coon Grooming   maine coon, groom    
Maine Coon Grooming   coon, groom

keywordFile <- tibble(EventName = c("All-day tabby fest", "All-day tabby fest", "Maine Coon Grooming","Maine Coon Grooming"), Keyword = c("tabby, all-day", "tabby, fest", "maine coon, groom", "coon, groom")

文本文件如下:

Description
Bring your tabby to the fest on Tuesday
All cats are welcome to the fest on Tuesday
Mainecoon grooming will happen at noon Wednesday
Maine coons will be pampered at noon on Wednesday

text <- tibble(Description = c("Bring your tabby to the fest on Tuesday","All cats are welcome to the fest on Tuesday","Mainecoon grooming will happen at noon Wednesday","Maine coons will be pampered at noon on Wednesday")

我想要的是遍历文本文件并查找模糊匹配项(必须在“关键字”列中包含每个单词)并返回一个显示TRUE或False的新列。如果是TRUE,那么我希望第三列显示事件名称。看起来像这样:

Description                                          Match?   Event Name
Bring your tabby to the fest on Tuesday              TRUE     All-day tabby fest
All cats are welcome to the fest on Tuesday          FALSE
Mainecoon grooming will happen at noon Wednesday     TRUE     Maine Coon Grooming
Maine coons will be pampered at noon on Wednesday    FALSE

借助Molx(How can I check if multiple strings exist in another string?),我能够使用这样的东西成功地进行模糊匹配(将所有内容转换为小写之后):

str <- c("tabby", "all-day")
myStr <- "Bring your tabby to the fest on Tuesday"
all(sapply(str, grepl, myStr))

但是,当我尝试模糊匹配整个文件时,我陷入了困境。我尝试过这样的事情:

for (i in seq_along(text$Description)){
  for (j in seq_along(keywordFile$EventName)) {
    # below I am creating the TRUE/FALSE column
    text$TF[i] <- all(sapply(keywordFile$Keyword[j], grepl, 
                                                     text$Description[i]))
    if (isTRUE(text$TF))
      # below I am creating the EventName column
      text$EventName <- keywordFile$EventName
    }
}

我认为将正确的东西转换为向量和字符串没有麻烦。我的keywordFile $ Keyword列是一串字符串向量,而我的text $ Description列是一个字符串。但是我在如何正确地遍历两个文件方面都在挣扎。我收到的错误是

Error in ... replacement has 13 rows, data has 1

以前有人做过这样的事吗?

1 个答案:

答案 0 :(得分:2)

我不确定您是否提出了问题,因为我不会称呼grepl()模糊匹配。如果它在更长的单词中,它将宁愿捕获关键字。因此,“ cat”和“ catastrophe”将是一场比赛,因为这些词非常不同。

我选择写一个答案是,您可以控制stil构成匹配项的字符串之间的距离:

加载库:

library(tibble)
library(dplyr)
library(fuzzyjoin)
library(tidytext)
library(tidyr)

制作字典和数据对象:

dict <- tibble(Event_Name = c(
  "All-day tabby fest",
  "All-day tabby fest",
  "Maine Coon Grooming",
  "Maine Coon Grooming"
), Keyword = c(
  "tabby, all-day",
  "tabby, fest",
  "maine coon, groom",
  "coon, groom"
)) %>% 
  mutate(Keyword = strsplit(Keyword, ", ")) %>% 
  unnest(Keyword)

string <- tibble(id = 1:4, Description = c(
  "Bring your tabby to the fest on Tuesday",
  "All cats are welcome to the fest on Tuesday",
  "Mainecoon grooming will happen at noon Wednesday",
  "Maine coons will be pampered at noon on Wednesday"
))

将字典应用于数据:

string_annotated <- string %>% 
  unnest_tokens(output = "word", input = Description) %>%
  stringdist_left_join(y = dict, by = c("word" = "Keyword"), max_dist = 1) %>% 
  mutate(match = !is.na(Keyword))

> string_annotated
# A tibble: 34 x 5
      id word    Event_Name         Keyword match
   <int> <chr>   <chr>              <chr>   <lgl>
 1     1 bring   NA                 NA      FALSE
 2     1 your    NA                 NA      FALSE
 3     1 tabby   All-day tabby fest tabby   TRUE 
 4     1 tabby   All-day tabby fest tabby   TRUE 
 5     1 to      NA                 NA      FALSE
 6     1 the     NA                 NA      FALSE
 7     1 fest    All-day tabby fest fest    TRUE 
 8     1 on      NA                 NA      FALSE
 9     1 tuesday NA                 NA      FALSE
10     2 all     NA                 NA      FALSE
# ... with 24 more rows

max_dist控制仍构成匹配项的内容。在这种情况下,字符串1或更短的距离会找到所有文本的匹配项,但我也尝试使用不匹配的字符串来进行匹配。

如果要将这种长格式恢复为原始格式:

string_annotated_col <- string_annotated %>% 
  group_by(id) %>% 
  summarise(Description = paste(word, collapse = " "),
            match = sum(match),
            keywords = toString(unique(na.omit(Keyword))),
            Event_Name = toString(unique(na.omit(Event_Name))))

> string_annotated_col
# A tibble: 4 x 5
     id Description                                       match keywords         Event_Name         
  <int> <chr>                                             <int> <chr>            <chr>              
1     1 bring your tabby tabby to the fest on tuesday         3 tabby, fest      All-day tabby fest 
2     2 all cats are welcome to the fest on tuesday           1 fest             All-day tabby fest 
3     3 mainecoon grooming will happen at noon wednesday      2 maine coon, coon Maine Coon Grooming
4     4 maine coons will be pampered at noon on wednesday     2 coon             Maine Coon Grooming

如果一部分答案对您没有意义,请随时提出问题。 here中对其中的一些进行了说明。除了模糊匹配部分。