Question

我想分析100个字符长度的字段并估计相似度％。例如，对于同一个问题“您对智能手机的看法是什么？”，

人A： “浪费钱的最好方法”

人B： “很棒的东西。让你随时保持联系”

人C： “浪费金钱和时间的工具”

其中，仅通过匹配单个单词，A和C声音相似。我想尝试做这样的事情，先从 R 开始，然后再进行扩展，以匹配“Best”，“Best way”，“Best way waste”等单词的组合。我是新手文本分析和R无法正确命名这些方法进行有效搜索。

请引导我提供您的输入和参考。在此先感谢

Answer 1

这是手动查看百分比相似性的潜在解决方案。

a <- "Best way to waste money"
b <- "Amazing stuff. lets you stay connected all the time"
c <- "Instrument to waste money and time"

format <- function(string1){ #removing the information from the string which presumably isn't important (punctuation, capital letters. then splitting all the words into separate strings)
  lower <- tolower(string1)
  no.punct <- gsub("[[:punct:]]", "", lower)
  split <- strsplit(no.punct, split=" ")
  return(split)
}

a <- format(a)
b <- format(b)
c <- format(c)

sim.per <- function(str1, str2, ...){#how similar is string 1 to string 2. NOTE: the order is important, ie. sim.per(b,c) is different from sim.per(c,b)
  sim <- length(intersect(str1[[1]], str2[[1]]))#intersect function counts the common strings
  total <- length(str1[[1]])
  per <- sim/total
  return(per)
}

#test
sim.per(b, c)

我希望有所帮助！要搜索单词组合，您需要做更多的魔法。我想尝试编辑以准确显示您正在寻找的内容，并且您可能会有更多的运气和答案！

至于参考文献，请查看Gaston Sanchez的“处理和处理R中的字符串”，这很棒。

字符串匹配以估计相似性

1 个答案: