数据帧中的字符串模糊匹配

时间:2018-04-08 06:33:22

标签: r fuzzy-logic stringdist record-linkage

我有一个数据框,其中包含文章标题和相关的网址链接。

我的问题是在相应标题的行中没有必要使用url链接,例如:

ridge/lasso

我的猜测是我需要考虑如此模糊的匹配逻辑,但我不确定如何。对于重复项,我将使用 title | urls Who will be the next president? | https://website/5-ways-to-make-a-cocktail.com 5 ways to make a cocktail | https://website/who-will-be-the-next-president.com 2 millions raised by this startup | https://website/how-did-you-find-your-house.com How did you find your house | https://website/2-millions-raised-by-this-startup.com How did you find your house | https://washingtonpost/article/latest-movies-in-theater.com Latest movies in Theater | www.newspaper/mynews/what-to-cook-in-summer.com What to cook in summer | https://website/2-millions-raised-by-this-startup.com 函数。

我开始使用unique包中的levenshteinSim函数,该函数为每行提供相似度分数,但显然行不匹配时,相似性得分在各地都很低。

我也听说过RecordLinkage包中的stringdistmatrix函数,但不知道如何在此处使用它。

1 个答案:

答案 0 :(得分:1)

肯定可以优化,但这可能会让你开始:

  1. 函数matcher() convert比较两个字符串并产生分数
  2. 之后我们会尝试将标题与matcher()匹配并获得最高分
  3. 如果无法找到高于阈值的分数,请收取NA
  4. <小时/> 在R

    matcher <- function(needle, haystack) {
      ### Analyzes the url part, converts them to lower case words
      ### and calculates a score to return
    
      # convert url
      y <- unlist(strsplit(haystack, '/'))
      y <- tolower(unlist(strsplit(y[length(y)], '[-.]')))
    
      # convert needle
      x <- needle
    
      # sum it up
      (z <- (sum(x %in% y) / length(x) + sum(y %in% x) / length(y)) / 2)
    }
    
    pairer <- function(titles, urls, threshold = 0.75) {
      ### Calculates scores for each title -> url combination
      result <- vector(length = length(titles))
      for (i in seq_along(titles)) {
        needle <- tolower(unlist(strsplit(titles[i], ' ')))
        scores <- unlist(lapply(urls, function(url) matcher(needle, url)))
        high_score <- max(scores)
    
        # above threshold ?
        result[i] <- ifelse(high_score >= threshold, 
                            urls[which(scores == high_score)], NA)
      }
      return(result)
    }
    
    df$guess <- pairer(df$title, df$urls)
    df
    

    这会产生

                                  title                                                        urls                                                       guess
    1   Who will be the next president?               https://website/5-ways-to-make-a-cocktail.com          https://website/who-will-be-the-next-president.com
    2         5 ways to make a cocktail          https://website/who-will-be-the-next-president.com               https://website/5-ways-to-make-a-cocktail.com
    3 2 millions raised by this startup             https://website/how-did-you-find-your-house.com       https://website/2-millions-raised-by-this-startup.com
    4       How did you find your house       https://website/2-millions-raised-by-this-startup.com             https://website/how-did-you-find-your-house.com
    5       How did you find your house https://washingtonpost/article/latest-movies-in-theater.com             https://website/how-did-you-find-your-house.com
    6          Latest movies in Theater             www.newspaper/mynews/what-to-cook-in-summer.com https://washingtonpost/article/latest-movies-in-theater.com
    7            What to cook in summer       https://website/2-millions-raised-by-this-startup.com             www.newspaper/mynews/what-to-cook-in-summer.com
    >