我有一个字符串向量。矢量的一些字符串(可能多于两个)在它们包含的单词方面彼此相似。我想过滤掉与矢量的任何其他字符串具有超过30%的余弦相似度的字符串。在比较的两个字符串中,我希望保持字符串更多的单词。也就是说,我只想要那些与原始向量的任何字符串具有小于30%相似性的字符串。我的目的是过滤掉类似的字符串,只保留大致不同的字符串。
实施例。矢量是:
x <- c("Dan is a good man and very smart", "A good man is rare", "Alex can be trusted with anything", "Dan likes to share his food", "Rare are man who can be trusted", "Please share food")
结果应该给出(假设相似度小于30%):
c("Dan is a good man and very smart", "Dan likes to share his food", "Rare are man who can be trusted")
上述结果尚未经过验证。
我使用的余弦代码:
CSString_vector <- c("String One","String Two")
corp <- tm::VCorpus(VectorSource(CSString_vector))
controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf),
weighting = weightTf)
dtm <- DocumentTermMatrix(corp,control = controlForMatrix)
matrix_of_vector = as.matrix(dtm)
res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,])
我在RStudio工作。
答案 0 :(得分:1)
所以,改写你想要的东西:你想计算所有字符串对的成对相似性。然后,您可以使用该相似性矩阵来识别不同的字符串组,以形成不同的组。对于这些组中的每一个,您希望删除除最长字符串之外的所有字符串并返回该字符串。我做对了吗?
经过一些实验,这是我提出的解决方案,一步一步:
igraph
包注意:我必须将阈值调整为0.4以使您的示例正常工作。
这很大程度上取决于您提供的代码,但我将其打包为一个函数并使用tidyverse
来制作代码,至少根据我的口味,更具可读性。
library(tm)
library(lsa)
library(tidyverse)
get_cos_sim <- function(corpus) {
# pre-process corpus
doc <- corpus %>%
VectorSource %>%
tm::VCorpus()
# get term frequency matrix
tfm <- doc %>%
DocumentTermMatrix(
control = corpus %>% list(
removePunctuation = TRUE,
wordLengths = c(1, Inf),
weighting = weightTf)) %>%
as.matrix()
# get row-wise similarity
sim <- NULL
for(i in 1:nrow(tfm)) {
sim_i <- apply(
X = tfm,
MARGIN = 1,
FUN = lsa::cosine,
tfm[i,])
sim <- rbind(sim, sim_i)
}
# set identity diagonal to zero
diag(sim) <- 0
# label and return
rownames(sim) <- corpus
return(sim)
}
现在我们将此功能应用于您的示例数据
# example corpus
strings <- c(
"Dan is a good man and very smart",
"A good man is rare",
"Alex can be trusted with anything",
"Dan likes to share his food",
"Rare are man who can be trusted",
"Please share food")
# get pairwise similarities
sim <- get_cos_sim(strings)
# binarize (using a different threshold to make your example work)
sim <- sim > .4
这结果是一个有趣的问题!我找到了this paper,Chalermsook&amp; Chuzhoy:最大独立矩形集,导致我在igraph
包中this implementation。基本上,我们将相似的字符串视为图中的连接顶点,然后在整个相似度矩阵的图中查找不同的组
library(igraph)
# create graph from adjacency matrix
cliques <- sim %>%
dplyr::as_data_frame() %>%
mutate(from = row_number()) %>%
gather(key = 'to', value = 'edge', -from) %>%
filter(edge == T) %>%
graph_from_data_frame(directed = FALSE) %>%
max_cliques()
现在我们可以使用cliques列表来检索每个vertices
的字符串,并选择每个clique最长的字符串。 警告:图表中缺少语料库中没有类似字符串的字符串。我正在手动添加它们。 igraph
包中可能有一个更好处理它的函数,如果有人发现某些东西会感兴趣
# get the string indices per vertex clique first
string_cliques_index <- cliques %>%
unlist %>%
names %>%
as.numeric
# find the indices that are distinct but not in a clique
# (i.e. unconnected vertices)
string_uniques_index <- colnames(sim)[!colnames(sim) %in% string_cliques_index] %>%
as.numeric
# get a list with all indices
all_distict <- cliques %>%
lapply(names) %>%
lapply(as.numeric) %>%
c(string_uniques_index)
# get a list of distinct strings
lapply(all_distict, find_longest, strings)
让我们用更长的不同字符串向量来测试它:
strings <- c(
"Dan is a good man and very smart",
"A good man is rare",
"Alex can be trusted with anything",
"Dan likes to share his food",
"Rare are man who can be trusted",
"Please share food",
"NASA is a government organisation",
"The FBI organisation is part of the government of USA",
"Hurricanes are a tragedy",
"Mangoes are very tasty to eat ",
"I like to eat tasty food",
"The thief was caught by the FBI")
我得到了这个二值化的相似度矩阵:
Dan is a good man and very smart FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
A good man is rare TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Alex can be trusted with anything FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Dan likes to share his food FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Rare are man who can be trusted FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Please share food FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
NASA is a government organisation FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The FBI organisation is part of the government of USA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Hurricanes are a tragedy FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Mangoes are very tasty to eat FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
I like to eat tasty food FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
The thief was caught by the FBI FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
基于这些相似之处,预期结果将是:
# included
Dan is a good man and very smart
Alex can be trusted with anything
Dan likes to share his food
NASA is a government organisation
The FBI organisation is part of the government of USA
Hurricanes are a tragedy
Mangoes are very tasty to eat
# omitted
A good man is rare
Rare are man who can be trusted
Please share food
I like to eat tasty food
The thief was caught by the FBI
实际输出具有正确的元素,但不是原始顺序。 您可以使用原始字符串向量重新排序
[[1]]
[1] "The FBI organisation is part of the government of USA"
[[2]]
[1] "Dan is a good man and very smart"
[[3]]
[1] "Alex can be trusted with anything"
[[4]]
[1] "Dan likes to share his food"
[[5]]
[1] "Mangoes are very tasty to eat "
[[6]]
[1] "NASA is a government organisation"
[[7]]
[1] "Hurricanes are a tragedy"
这就是全部! 希望这是你正在寻找的,可能对其他人有用。