在String向量中使用余弦相似度来过滤掉类似的字符串

时间:2018-04-19 09:04:38

标签: r

我有一个字符串向量。矢量的一些字符串(可能多于两个)在它们包含的单词方面彼此相似。我想过滤掉与矢量的任何其他字符串具有超过30%的余弦相似度的字符串。在比较的两个字符串中,我希望保持字符串更多的单词。也就是说,我只想要那些与原始向量的任何字符串具有小于30%相似性的字符串。我的目的是过滤掉类似的字符串,只保留大致不同的字符串。

实施例。矢量是:

x <- c("Dan is a good man and very smart", "A good man is rare", "Alex can be trusted with anything", "Dan likes to share his food", "Rare are man who can be trusted", "Please share food")

结果应该给出(假设相似度小于30%):

c("Dan is a good man and very smart", "Dan likes to share his food", "Rare are man who can be trusted")

上述结果尚未经过验证。

我使用的余弦代码:

CSString_vector <- c("String One","String Two")
    corp <- tm::VCorpus(VectorSource(CSString_vector))
    controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf),
    weighting = weightTf)
    dtm <- DocumentTermMatrix(corp,control = controlForMatrix)
    matrix_of_vector = as.matrix(dtm)
    res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,])

我在RStudio工作。

1 个答案:

答案 0 :(得分:1)

所以,改写你想要的东西:你想计算所有字符串对的成对相似性。然后,您可以使用该相似性矩阵来识别不同的字符串组,以形成不同的组。对于这些组中的每一个,您希望删除除最长字符串之外的所有字符串并返回该字符串。我做对了吗?

经过一些实验,这是我提出的解决方案,一步一步:

  • 计算相似度矩阵并使用阈值
  • 对其进行二值化
  • 使用igraph
  • 中的图算法识别不同的组(派系)
  • 查找每个clique中的所有字符串并保留最长的字符串

注意:我必须将阈值调整为0.4以使您的示例正常工作。

相似矩阵

这很大程度上取决于您提供的代码,但我将其打包为一个函数并使用tidyverse来制作代码,至少根据我的口味,更具可读性。

library(tm)
library(lsa)
library(tidyverse)

get_cos_sim <- function(corpus) {
  # pre-process corpus
  doc <- corpus %>%
    VectorSource %>%
    tm::VCorpus()
  # get term frequency matrix
  tfm <- doc %>%
    DocumentTermMatrix(
      control = corpus %>% list(
        removePunctuation = TRUE,
        wordLengths = c(1, Inf),
        weighting = weightTf)) %>%
    as.matrix()
  # get row-wise similarity
  sim <- NULL
  for(i in 1:nrow(tfm)) {
    sim_i <- apply(
      X = tfm, 
      MARGIN = 1, 
      FUN = lsa::cosine, 
      tfm[i,])
    sim <- rbind(sim, sim_i)
  }
  # set identity diagonal to zero
  diag(sim) <- 0
  # label and return
  rownames(sim) <- corpus
  return(sim)
}

现在我们将此功能应用于您的示例数据

# example corpus
strings <- c(
  "Dan is a good man and very smart", 
  "A good man is rare", 
  "Alex can be trusted with anything", 
  "Dan likes to share his food", 
  "Rare are man who can be trusted", 
  "Please share food")

# get pairwise similarities
sim <- get_cos_sim(strings)
# binarize (using a different threshold to make your example work)
sim <- sim > .4  

识别不同的群组

这结果是一个有趣的问题!我找到了this paper,Chalermsook&amp; Chuzhoy:最大独立矩形集,导致我在igraph包中this implementation。基本上,我们将相似的字符串视为图中的连接顶点,然后在整个相似度矩阵的图中查找不同的组

library(igraph)

# create graph from adjacency matrix
cliques <- sim %>% 
  dplyr::as_data_frame() %>%
  mutate(from = row_number()) %>% 
  gather(key = 'to', value = 'edge', -from) %>% 
  filter(edge == T) %>%
  graph_from_data_frame(directed = FALSE) %>%
  max_cliques()

查找最长字符串

现在我们可以使用cliques列表来检索每个vertices的字符串,并选择每个clique最长的字符串。 警告:图表中缺少语料库中没有类似字符串的字符串。我正在手动添加它们。 igraph包中可能有一个更好处理它的函数,如果有人发现某些东西会感兴趣

# get the string indices per vertex clique first
string_cliques_index <- cliques %>% 
  unlist %>%
  names %>%
  as.numeric
# find the indices that are distinct but not in a clique
# (i.e. unconnected vertices)
string_uniques_index <- colnames(sim)[!colnames(sim) %in% string_cliques_index] %>%
  as.numeric
# get a list with all indices
all_distict <- cliques %>% 
  lapply(names) %>% 
  lapply(as.numeric) %>%
  c(string_uniques_index)
# get a list of distinct strings
lapply(all_distict, find_longest, strings)  

测试用例:

让我们用更长的不同字符串向量来测试它:

strings <- c(
  "Dan is a good man and very smart", 
  "A good man is rare", 
  "Alex can be trusted with anything", 
  "Dan likes to share his food", 
  "Rare are man who can be trusted", 
  "Please share food",
  "NASA is a government organisation",
  "The FBI organisation is part of the government of USA",
  "Hurricanes are a tragedy",
  "Mangoes are very tasty to eat ",
  "I like to eat tasty food",
  "The thief was caught by the FBI")

我得到了这个二值化的相似度矩阵:

Dan is a good man and very smart                      FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
A good man is rare                                     TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Alex can be trusted with anything                     FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Dan likes to share his food                           FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Rare are man who can be trusted                       FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Please share food                                     FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
NASA is a government organisation                     FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The FBI organisation is part of the government of USA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
Hurricanes are a tragedy                              FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Mangoes are very tasty to eat                         FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
I like to eat tasty food                              FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
The thief was caught by the FBI                       FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

基于这些相似之处,预期结果将是:

# included
Dan is a good man and very smart
Alex can be trusted with anything
Dan likes to share his food
NASA is a government organisation
The FBI organisation is part of the government of USA
Hurricanes are a tragedy
Mangoes are very tasty to eat

# omitted
A good man is rare
Rare are man who can be trusted
Please share food
I like to eat tasty food
The thief was caught by the FBI

实际输出具有正确的元素,但不是原始顺序。 您可以使用原始字符串向量重新排序

[[1]]
[1] "The FBI organisation is part of the government of USA"

[[2]]
[1] "Dan is a good man and very smart"

[[3]]
[1] "Alex can be trusted with anything"

[[4]]
[1] "Dan likes to share his food"

[[5]]
[1] "Mangoes are very tasty to eat "

[[6]]
[1] "NASA is a government organisation"

[[7]]
[1] "Hurricanes are a tragedy"

这就是全部! 希望这是你正在寻找的,可能对其他人有用。