R中的模糊匹配

时间:2017-11-13 19:04:37

标签: r fuzzywuzzy

我正在尝试使用名称向量检测打开文本字段(读取:凌乱!)之间的匹配。我创造了一个愚蠢的水果例子,突出了我的主要挑战。

df1 <- data.frame(id = c(1, 2, 3, 4, 5, 6),
              entry = c("Apple", 
                        "I love apples", 
                        "appls",
                        "Bannanas",
                        "banana", 
                        "An apple a day keeps..."))
df1$entry <- as.character(df1$entry)

df2 <- data.frame(fruit=c("apple",
                          "banana",
                          "pineapple"),
                  code=c(11, 12, 13))
df2$fruit <- as.character(df2$fruit)

df1 %>%
  mutate(match = str_detect(str_to_lower(entry), 
                            str_to_lower(df2$fruit)))

如果你愿意的话,我的方法会抓住低悬的水果(#34; Apple&#34;&#34;香蕉&#34;的完全匹配)。

#  id                   entry match
#1  1                   Apple  TRUE
#2  2           I love apples FALSE
#3  3                   appls FALSE
#4  4                Bannanas FALSE
#5  5                  banana  TRUE
#6  6 An apple a day keeps... FALSE

无与伦比的案例有不同的挑战:

  1. 案例2和案例6中的目标水果嵌入较大的字符串中。
  2. 3和4中的目标水果需要模糊匹配。
  3. fuzzywuzzyR包很棒,做得很好(有关安装python模块的详细信息,请参阅页面)。

    library(fuzzywuzzyR)
    choices <- df2$fruit
    word <- df1$entry[3]  # "appls"
    
    init_proc = FuzzUtils$new()      
    PROC = init_proc$Full_process    
    PROC1 = tolower                  
    
    init_scor = FuzzMatcher$new()    
    SCOR = init_scor$WRATIO          
    
    init <- FuzzExtract$new()        
    
    init$Extract(string = word, 
                 sequence_strings = choices, 
                 processor = PROC, 
                 scorer = SCOR)
    

    此设置为&#34; apple&#34;返回80分。 (最高的)。

    除了fuzzywuzzyR之外还有其他方法可以考虑吗?你会如何解决这个问题?

    添加fuzzywuzzyR输出:

    [[1]]
    [[1]][[1]]
    [1] "apple"
    
    [[1]][[2]]
    [1] 80
    
    
    [[2]]
    [[2]][[1]]
    [1] "pineapple"
    
    [[2]][[2]]
    [1] 72
    
    
    [[3]]
    [[3]][[1]]
    [1] "banana"
    
    [[3]][[2]]
    [1] 18
    

1 个答案:

答案 0 :(得分:2)

我今天在回答问题时发现了这个问题。所以我想回答原来的问题。

library(dplyr)
library(fuzzyjoin)

df1 %>%
  stringdist_left_join(df2, by=c(entry="fruit"), ignore_case=T, method="jw", distance_col="dist") %>%
  group_by(entry) %>%
  top_n(-1) %>%
  select(-dist)

输出为:

     id entry                   fruit      code
  <dbl> <fct>                   <fct>     <dbl>
1  1.00 Apple                   apple      11.0
2  2.00 I love apples           pineapple  13.0
3  3.00 appls                   apple      11.0
4  4.00 Bannanas                banana     12.0
5  5.00 banana                  banana     12.0
6  6.00 An apple a day keeps... apple      11.0

示例数据:

df1 <- data.frame(id = c(1, 2, 3, 4, 5, 6),
                  entry = c("Apple", "I love apples", "appls", "Bannanas", "banana", "An apple a day keeps..."))
df2 <- data.frame(fruit=c("apple", "banana", "pineapple"), code=c(11, 12, 13))