我有很多我要比较的文字句子,但这里有小红帽的例子
text1 <- "Once upon a time"
text2 <- "there was a dear little girl"
text3 <- "who was loved by everyone who looked at her"
我想创建一个矩阵来计算像这样的常用词
text1_split <- unlist(strsplit(text1, " "))
text2_split <- unlist(strsplit(text2, " "))
text3_split <- unlist(strsplit(text3, " "))
length(intersect(text1_split, text2_split))
length(intersect(text2_split, text3_split))
texts <- c("text1","text2","text3")
data <- data.frame(texts)
data[, texts] <- NA
rownames(data) <- texts
data <- data[,-1]
data[1,1] <- length(intersect(text1_split, text1_split))
data[1,2] <- length(intersect(text1_split, text2_split))
data[1,3] <- length(intersect(text1_split, text3_split))
我的矩阵的结果就是这个
text1 text2 text3
text1 4 1 0
text2 NA NA NA
text3 NA NA NA
有没有办法以有效的方式完成矩阵?我有超过100个句子要比较。这是一篇类似但不相同的帖子:Count common words in two strings in R
答案 0 :(得分:1)
试试这个:
CommonWordsMatrixOld <- function(vList) {
v <- lapply(vList, tolower)
do.call(rbind, lapply(v, function(x) {
xSplit <- strsplit(x, " ")[[1]]
do.call(c, lapply(v, function(y) length(intersect(xSplit, strsplit(y, " ")[[1]]))))
}))
}
myText <- list(text1, text2, text3)
打电话给我们:
CommonWordsMatrixOld(myText)
[,1] [,2] [,3]
[1,] 4 1 0
[2,] 1 6 1
[3,] 0 1 8
对于OP请求的大小的数据而言,它的速度非常快。获得的数据为here:
testWords <- read.csv("4000-most-common-english-words-csv.csv", stringsAsFactors = FALSE)
set.seed(1111)
myTestText <- lapply(1:100, function(x) {
paste(testWords[sample(1000:1020, sample(30, 1), replace = TRUE),],collapse = " ")
})
myTestText[[15]]
[1] "access restaurant video opinion video eventually fresh eventually
reform credit publish judge Senate publish fresh restaurant publish
version Senate critical release recall relation version"
system.time(test1 <- CommonWordsMatrixOld(myTestText))
user system elapsed
0.625 0.009 0.646
这是输出:
test1[1:10,1:10]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 9 3 5 1 3 4 4 2 2 1
[2,] 3 5 3 1 1 3 3 0 0 1
[3,] 5 3 12 0 3 8 4 3 2 1
[4,] 1 1 0 1 0 0 1 0 0 0
[5,] 3 1 3 0 4 2 1 1 1 0
[6,] 4 3 8 0 2 13 7 4 1 1
[7,] 4 3 4 1 1 7 10 4 1 1
[8,] 2 0 3 0 1 4 4 7 3 0
[9,] 2 0 2 0 1 1 1 3 4 0
[10,] 1 1 1 0 0 1 1 0 0 2
<强>更新强>
这是一个更快的算法,它可以删除许多不必要的操作并利用lower.tri
,同时保持非常一般。
CommonWordsMatrixNew <- function(vList) {
v <- lapply(vList, function(x) tolower(strsplit(x, " ")[[1]]))
s <- length(v)
m <- do.call(rbind, lapply(1L:s, function(x) {
c(rep(0L,(x-1L)), do.call(c, lapply(x:s, function(y) length(intersect(v[[x]], v[[y]])))))
}))
m[lower.tri(m)] <- t(m)[lower.tri(m)]
m
}
为了让您了解性能提升,这里有一些基准测试。(应该注意OP的解决方案不是分割矢量,所以它不是真正的比较)。新算法几乎是OP解决方案的两倍。
microbenchmark(New=CommonWordsMatrixNew(myTestText),
Old=CommonWordsMatrixOld(myTestText),
Pach=CommonWordsMatrixPach(PreSplit1), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
New 78.64434 79.07127 86.10754 79.72828 81.39679 137.0695 10
Old 321.49031 323.89835 326.61801 325.75221 328.50877 335.3306 10
Pach 138.34742 143.00504 145.35147 145.17376 148.34699 151.5535 10
identical(CommonWordsMatrixNew(myTestText), CommonWordsMatrixOld(myTestText), CommonWordsMatrixPach(PreSplit1))
[1] TRUE
新算法将strsplit
的呼叫次数减少了n^2 - n
次(例如,在上述示例中,strplit
在原始算法中被称为10000
次,并且在更新版本中仅100
次。另外,由于得到的矩阵是对称的,因此不需要多次计算每个句子之间的相互作用,因此x = 1:s
函数中的y = x:s
和lapply
。这些循环的计算次数从n^2
减少到nth triangle number = (n*(n+1)/2)
(例如,在上面的示例中,从10000
到5050
)。在那之后,我们依靠R
中的索引功能,这通常比手动制造快得多。
答案 1 :(得分:0)
我发现事先分裂会提高速度,所以
CommonWordsMatrix <- function(vList) {
v <- lapply(vList, tolower)
do.call(rbind, lapply(v, function(x) {
do.call(c, lapply(v, function(y) length(intersect(x, y))))
}))
}
是一个不错的选择(x和y是单词的预分割向量)