我有两个文件,一个充满关键字(大约2,000行),另一个充满文本(大约770,000行)。关键字文件如下:
Event Name Keyword
All-day tabby fest tabby, all-day
All-day tabby fest tabby, fest
Maine Coon Grooming maine coon, groom
Maine Coon Grooming coon, groom
keywordFile <- tibble(EventName = c("All-day tabby fest", "All-day tabby fest", "Maine Coon Grooming","Maine Coon Grooming"), Keyword = c("tabby, all-day", "tabby, fest", "maine coon, groom", "coon, groom")
文本文件如下:
Description
Bring your tabby to the fest on Tuesday
All cats are welcome to the fest on Tuesday
Mainecoon grooming will happen at noon Wednesday
Maine coons will be pampered at noon on Wednesday
text <- tibble(Description = c("Bring your tabby to the fest on Tuesday","All cats are welcome to the fest on Tuesday","Mainecoon grooming will happen at noon Wednesday","Maine coons will be pampered at noon on Wednesday")
我想要的是遍历文本文件并查找模糊匹配项(必须在“关键字”列中包含每个单词)并返回一个显示TRUE或False的新列。如果是TRUE,那么我希望第三列显示事件名称。看起来像这样:
Description Match? Event Name
Bring your tabby to the fest on Tuesday TRUE All-day tabby fest
All cats are welcome to the fest on Tuesday FALSE
Mainecoon grooming will happen at noon Wednesday TRUE Maine Coon Grooming
Maine coons will be pampered at noon on Wednesday FALSE
借助Molx(How can I check if multiple strings exist in another string?),我能够使用这样的东西成功地进行模糊匹配(将所有内容转换为小写之后):
str <- c("tabby", "all-day")
myStr <- "Bring your tabby to the fest on Tuesday"
all(sapply(str, grepl, myStr))
但是,当我尝试模糊匹配整个文件时,我陷入了困境。我尝试过这样的事情:
for (i in seq_along(text$Description)){
for (j in seq_along(keywordFile$EventName)) {
# below I am creating the TRUE/FALSE column
text$TF[i] <- all(sapply(keywordFile$Keyword[j], grepl,
text$Description[i]))
if (isTRUE(text$TF))
# below I am creating the EventName column
text$EventName <- keywordFile$EventName
}
}
我认为将正确的东西转换为向量和字符串没有麻烦。我的keywordFile $ Keyword列是一串字符串向量,而我的text $ Description列是一个字符串。但是我在如何正确地遍历两个文件方面都在挣扎。我收到的错误是
Error in ... replacement has 13 rows, data has 1
以前有人做过这样的事吗?
答案 0 :(得分:2)
我不确定您是否提出了问题,因为我不会称呼grepl()
模糊匹配。如果它在更长的单词中,它将宁愿捕获关键字。因此,“ cat”和“ catastrophe”将是一场比赛,因为这些词非常不同。
我选择写一个答案是,您可以控制stil构成匹配项的字符串之间的距离:
加载库:
library(tibble)
library(dplyr)
library(fuzzyjoin)
library(tidytext)
library(tidyr)
制作字典和数据对象:
dict <- tibble(Event_Name = c(
"All-day tabby fest",
"All-day tabby fest",
"Maine Coon Grooming",
"Maine Coon Grooming"
), Keyword = c(
"tabby, all-day",
"tabby, fest",
"maine coon, groom",
"coon, groom"
)) %>%
mutate(Keyword = strsplit(Keyword, ", ")) %>%
unnest(Keyword)
string <- tibble(id = 1:4, Description = c(
"Bring your tabby to the fest on Tuesday",
"All cats are welcome to the fest on Tuesday",
"Mainecoon grooming will happen at noon Wednesday",
"Maine coons will be pampered at noon on Wednesday"
))
将字典应用于数据:
string_annotated <- string %>%
unnest_tokens(output = "word", input = Description) %>%
stringdist_left_join(y = dict, by = c("word" = "Keyword"), max_dist = 1) %>%
mutate(match = !is.na(Keyword))
> string_annotated
# A tibble: 34 x 5
id word Event_Name Keyword match
<int> <chr> <chr> <chr> <lgl>
1 1 bring NA NA FALSE
2 1 your NA NA FALSE
3 1 tabby All-day tabby fest tabby TRUE
4 1 tabby All-day tabby fest tabby TRUE
5 1 to NA NA FALSE
6 1 the NA NA FALSE
7 1 fest All-day tabby fest fest TRUE
8 1 on NA NA FALSE
9 1 tuesday NA NA FALSE
10 2 all NA NA FALSE
# ... with 24 more rows
max_dist
控制仍构成匹配项的内容。在这种情况下,字符串1
或更短的距离会找到所有文本的匹配项,但我也尝试使用不匹配的字符串来进行匹配。
如果要将这种长格式恢复为原始格式:
string_annotated_col <- string_annotated %>%
group_by(id) %>%
summarise(Description = paste(word, collapse = " "),
match = sum(match),
keywords = toString(unique(na.omit(Keyword))),
Event_Name = toString(unique(na.omit(Event_Name))))
> string_annotated_col
# A tibble: 4 x 5
id Description match keywords Event_Name
<int> <chr> <int> <chr> <chr>
1 1 bring your tabby tabby to the fest on tuesday 3 tabby, fest All-day tabby fest
2 2 all cats are welcome to the fest on tuesday 1 fest All-day tabby fest
3 3 mainecoon grooming will happen at noon wednesday 2 maine coon, coon Maine Coon Grooming
4 4 maine coons will be pampered at noon on wednesday 2 coon Maine Coon Grooming
如果一部分答案对您没有意义,请随时提出问题。 here中对其中的一些进行了说明。除了模糊匹配部分。