我正在尝试使用名称向量检测打开文本字段(读取:凌乱!)之间的匹配。我创造了一个愚蠢的水果例子,突出了我的主要挑战。
df1 <- data.frame(id = c(1, 2, 3, 4, 5, 6),
entry = c("Apple",
"I love apples",
"appls",
"Bannanas",
"banana",
"An apple a day keeps..."))
df1$entry <- as.character(df1$entry)
df2 <- data.frame(fruit=c("apple",
"banana",
"pineapple"),
code=c(11, 12, 13))
df2$fruit <- as.character(df2$fruit)
df1 %>%
mutate(match = str_detect(str_to_lower(entry),
str_to_lower(df2$fruit)))
如果你愿意的话,我的方法会抓住低悬的水果(#34; Apple&#34;&#34;香蕉&#34;的完全匹配)。
# id entry match
#1 1 Apple TRUE
#2 2 I love apples FALSE
#3 3 appls FALSE
#4 4 Bannanas FALSE
#5 5 banana TRUE
#6 6 An apple a day keeps... FALSE
无与伦比的案例有不同的挑战:
fuzzywuzzyR
包很棒,做得很好(有关安装python模块的详细信息,请参阅页面)。
library(fuzzywuzzyR)
choices <- df2$fruit
word <- df1$entry[3] # "appls"
init_proc = FuzzUtils$new()
PROC = init_proc$Full_process
PROC1 = tolower
init_scor = FuzzMatcher$new()
SCOR = init_scor$WRATIO
init <- FuzzExtract$new()
init$Extract(string = word,
sequence_strings = choices,
processor = PROC,
scorer = SCOR)
此设置为&#34; apple&#34;返回80分。 (最高的)。
除了fuzzywuzzyR
之外还有其他方法可以考虑吗?你会如何解决这个问题?
添加fuzzywuzzyR
输出:
[[1]]
[[1]][[1]]
[1] "apple"
[[1]][[2]]
[1] 80
[[2]]
[[2]][[1]]
[1] "pineapple"
[[2]][[2]]
[1] 72
[[3]]
[[3]][[1]]
[1] "banana"
[[3]][[2]]
[1] 18
答案 0 :(得分:2)
我今天在回答问题时发现了这个问题。所以我想回答原来的问题。
library(dplyr)
library(fuzzyjoin)
df1 %>%
stringdist_left_join(df2, by=c(entry="fruit"), ignore_case=T, method="jw", distance_col="dist") %>%
group_by(entry) %>%
top_n(-1) %>%
select(-dist)
输出为:
id entry fruit code
<dbl> <fct> <fct> <dbl>
1 1.00 Apple apple 11.0
2 2.00 I love apples pineapple 13.0
3 3.00 appls apple 11.0
4 4.00 Bannanas banana 12.0
5 5.00 banana banana 12.0
6 6.00 An apple a day keeps... apple 11.0
示例数据:
df1 <- data.frame(id = c(1, 2, 3, 4, 5, 6),
entry = c("Apple", "I love apples", "appls", "Bannanas", "banana", "An apple a day keeps..."))
df2 <- data.frame(fruit=c("apple", "banana", "pineapple"), code=c(11, 12, 13))