字符串之间的最小距离函数

时间:2016-07-30 09:34:02

标签: r

完全编辑,非常感谢shayaa的建议!

矩阵中的句子(从csv读入),应检测到存储在列表中的单词(从txt读入)。

sentences_list <- matrix(c(
    "this screen is great", 
    "this camera is not bad", 
    "everything good but the camera is awesome",
    "everything bad but the camera is awesome",
    "battery is ok but the camera is awesome"), ncol = 1)

word_list_one <-list("screen", "camera", "battery")
word_list_two <-list("good", "great", "awesome")
word_list_three <-list("bad", "awful", "poor")
word_list_four <-list("not", "don't", "neither")

    one <- apply(sentences_list, 2, function(x) {
        str_detect(x, paste(word_list_one, sep = '|', collapse = '|'))
    })

    two <- apply(sentences_list, 2, function(x) {
      str_detect(x, paste(word_list_two, sep = '|', collapse = '|'))
    })

    three <- apply(sentences_list, 2, function(x) {
      str_detect(x, paste(word_list_three, sep = '|', collapse = '|'))
    })

    four <- apply(sentences_list, 2, function(x) {
      str_detect(x, paste(word_list_four, sep = '|', collapse = '|'))
    })

可以使用以下代码来查看匹配的单词。 (结果存储而不是直接显示,因为结果的数量在后果中以某种方式计算)

row=5

print(sentences_list[row])
c(str_extract(sentences_list[row], paste(word_list_one, sep = '|', collapse = '|')))
c(str_extract(sentences_list[row], paste(word_list_two, sep = '|', collapse = '|')))
c(str_extract(sentences_list[row], paste(word_list_three, sep = '|', collapse = '|')))
c(str_extract(sentences_list[row], paste(word_list_four, sep = '|', collapse = '|')))

对于row=1row=2,一切正常,但不适用于以下情况。这是因为只返回word_list_x的句子中的第一个匹配项。我更喜欢代码要做的是回复word_list_x的单词,该单词距离另一个word_list_中找到的单词的距离最近。

因此row=3 sentences_list word_list_two = "good" word_list_two = "awesome"的结果,因为它首先被找到。结果应该是row=3,因为在word_list_one = "camera"的句子中,它更接近row=4中的结果。

sentences_list中的word_list_three = "bad" word_list_two = "awesome"word_list_two的结果。由于word_list_one = "camera"的结果与word_list_two = "awesome"中的结果的距离更近,因此只应返回word_list_three = " "的结果,并将row=5留空。

sentences_list中的word_list_one = "battery" word_list_one = "camera"的结果,因为它是首先找到的。结果应该是row=5,因为在word_list_two= "great"的句子中,它更接近JLabel中的结果。

显然,作为一名新手,我对该项目的规模完全过度训练,我非常感谢您提供的任何帮助,非常感谢!

2 个答案:

答案 0 :(得分:0)

为什么不这样呢

我编辑了你的数据以便运行

df <- c("second" , "word1", "word2", "word3", 
          "word4","first",  "word1", "word2", "third")
one <- "third"
two <- c("second", "third")

匹配每个载体

match1 <-match(one, df)
match2 <- match(two, df)
match3 <- match("first",df)

确定最接近您要查找的单词的匹配向量的位置,在本例中为单词“first”

closest <- which.min(abs(match2 - match3))

现在检查你的答案

df[match1]
[1] "third"

df[match2[closest]]
[1] "third"

编辑回答您的修改:

我会这样做

library(stringr)
sentences_list <- list("this screen is great", 
  "this camera is not bad", 
  "everything good but the camera is awesome",
  "everything bad but the camera is awesome",
  "battery is ok but the camera is awesome")

word_list_one <- c("screen", "camera", "battery")
word_list_two <- c("good", "great", "awesome")
word_list_three <- c("bad", "awful", "poor")
word_list_four <- c("not", "don't", "neither")

l <- lapply(sentences_list, str_match_all, word_list_one)

str_match_all函数将返回5个列表的列表,每个列表包含三个元素。 l中的第一个列表返回第一个单词列表中的匹配项以及匹配的单词。

这与在原始矩阵中使用

保存它们相同
apply(sentences_list,1, str_match_all, word_list_one)

您应该能够使用我提供的原始答案完成示例。

答案 1 :(得分:0)

好的,这就是我想出来的。我采用的方法是结果为data.frame,其中第一列包含第一个列表中的一个单词,其他列包含#34;两个&#34;,&#34;三个&#34;和&#34;四&#34;包含每个列表中最接近的单词到第一列中的单词。首先,计算最小距离的两个函数:

getMinimumDistanceWord <- function(text, word, wordList){
  min <- " "
  minDist <- 1000
  for (w in wordList){
    d <- distanceBetweenWords(text, word, w)
    if (d != 0 && d < minDist){
      min <- w
      minDist <- d
    }
  }
  return (list(min, minDist))
}


distanceBetweenWords <- function(text, word1, word2){
  x <- strsplit(text, " ")[[1]]
  dist <- abs(grep(word1, x) - grep(word2, x))
  if (length(dist) == 0) return (0)
  else return (dist)
}

现在,迭代句子列表并计算最小距离:

res <- data.frame(one = character(), two = character(), three = character(), four = character(), stringsAsFactors=FALSE)
i <- 1
for(elem in sentences_list){
  base.word.list <- unlist(str_extract_all(elem, paste(word_list_one, sep = '|', collapse = '|')))
  res[i, 1] <- base.word.list[1]
  res[i, 2] <- getMinimumDistanceWord(elem, base.word.list[1], word_list_two)[1]
  res[i, 3] <- getMinimumDistanceWord(elem, base.word.list[1], word_list_three)[1]
  res[i, 4] <- getMinimumDistanceWord(elem, base.word.list[1], word_list_four)[1]
  if (length(base.word.list) != 1){
    currentDistance2 <- as.numeric(unlist(getMinimumDistanceWord(elem, base.word.list[1], word_list_two))[2])
    currentDistance3 <- as.numeric(unlist(getMinimumDistanceWord(elem, base.word.list[1], word_list_three))[2])
    currentDistance4 <- as.numeric(unlist(getMinimumDistanceWord(elem, base.word.list[1], word_list_four))[2])
    for(currentWord in base.word.list){
      if (getMinimumDistanceWord(elem, currentWord, word_list_two)[2] < as.numeric(currentDistance2)){
        currentDistance2 <- getMinimumDistanceWord(elem, currentWord, word_list_two)[2]
        res[i, 1] <- currentWord
        res[i, 2] <- getMinimumDistanceWord(elem, currentWord, word_list_two)[1]
      }
      if (getMinimumDistanceWord(elem, currentWord, word_list_three)[2] < as.numeric(currentDistance3)){
        currentDistance3 <- getMinimumDistanceWord(elem, currentWord, word_list_three)[2]
        res[i, 1] <- currentWord
        res[i, 3] <- getMinimumDistanceWord(elem, currentWord, word_list_three)[1]
      }
      if (getMinimumDistanceWord(elem, currentWord, word_list_four)[2] < as.numeric(currentDistance4)){
        currentDistance4 <- getMinimumDistanceWord(elem, currentWord, word_list_four)[2]
        res[i, 1] <- currentWord
        res[i, 4] <- getMinimumDistanceWord(elem, currentWord, word_list_four)[1]
      }
    }
  }
  i <- i+1
}

结果data.frame将是:

     one     two three four
1 screen   great           
2 camera           bad  not
3 camera awesome           
4 camera awesome   bad     
5 camera awesome 

例如,第一行表示最近的单词&#34; screen&#34; (在列表中)是#34;伟大的&#34; (列表中的两个),并且列表中没有其他最近的单词&#34;三个&#34;和&#34;四&#34;。同样,第五行表示最接近的单词(在第五句中)到#34;相机&#34;是#34;真棒&#34;。第二行说,在第二句中,有一个&#34;最近的&#34;字#&#34;相机&#34;在第三个列表中(&#34;坏&#34;),第四个列表中还有另一个最近的单词(&#34; not&#34;)。

我希望这会有所帮助。