R在文本中查找类似的句子

时间:2018-03-22 13:27:47

标签: r text similarity

我有一个问题,我正在努力寻找解决方案或解决方法。

我有一些模型句子,例如

model_sentences = data.frame("model_id" = c("model_id_1", "model_id_2"), "model_text" = c("Company x had 3000 employees in 2016.",
                                                                                          "Google makes 300 dollar in revenue in 2018."))

和一些文字

data = data.frame("id" = c("id1", "id2"), "text" = c("Company y is expected to employ 2000 employees in 2020. This is an increase of 10%. Some stupid sentences.",
                                                     "Amazon´s revenue is 400 dollar in 2020. That is twice as much as last year."))

我想从那些类似于模型句子的文本中提取句子。

像这样的东西将是我想要的解决方案

result = data.frame("id" = c("id1", "id2"), "model_id" = c("model_id_1", "model_id_2"), "sentence_from_data" = c("Company y is expected to employ 2000 employees in 2020.", "Amazon´s revenue is 400 dollar in 2020."), "score" = c(0.5, 0.4))

也许有可能找到一种'similar_score'。

我使用此功能按句子分割文本:

split_by_sentence <- function (text) {

  result <-unlist(strsplit(text, "(?<=[[:alnum:]]{4}[?!.])\\s+", perl=TRUE))

  result <- stri_trim_both(result)
  result <- result [nchar (result) > 0]

  if (length (result) == 0)
    result <- ""

  return (result)
}

但我不知道如何将每个句子与模型句子进行比较。 我很高兴有任何建议。

1 个答案:

答案 0 :(得分:0)

查看此资料包stringdist

示例:

library(stringdist)
mysent = "This is a sentence"
apply(model_sentences, 1, function(row) {
  stringdist(row['model_text'], mysent, method="jaccard")
})

它将返回mysent到model_text变量的jaccard距离。值越小,句子在给定距离测量方面更相似。