如何计算常用词并将结果存储在矩阵中?

时间:2016-09-21 22:40:12

标签: r matrix

我有很多我要比较的文字句子,但这里有小红帽的例子

text1 <- "Once upon a time"
text2 <- "there was a dear little girl"
text3 <- "who was loved by everyone who looked at her"

我想创建一个矩阵来计算像这样的常用词

text1_split <- unlist(strsplit(text1, " "))
text2_split <- unlist(strsplit(text2, " "))
text3_split <- unlist(strsplit(text3, " "))

length(intersect(text1_split, text2_split))
length(intersect(text2_split, text3_split))

texts <- c("text1","text2","text3")
data <- data.frame(texts)
data[, texts] <- NA
rownames(data) <- texts
data <- data[,-1]

data[1,1] <- length(intersect(text1_split, text1_split))
data[1,2] <- length(intersect(text1_split, text2_split))
data[1,3] <- length(intersect(text1_split, text3_split))

我的矩阵的结果就是这个

      text1 text2 text3
text1     4     1     0
text2    NA    NA    NA
text3    NA    NA    NA

有没有办法以有效的方式完成矩阵?我有超过100个句子要比较。这是一篇类似但不相同的帖子:Count common words in two strings in R

2 个答案:

答案 0 :(得分:1)

试试这个:

CommonWordsMatrixOld <- function(vList) {
    v <- lapply(vList, tolower)
    do.call(rbind, lapply(v, function(x) {
             xSplit <- strsplit(x, " ")[[1]]
             do.call(c, lapply(v, function(y) length(intersect(xSplit, strsplit(y, " ")[[1]]))))
        }))
}

myText <- list(text1, text2, text3)

打电话给我们:

CommonWordsMatrixOld(myText)
     [,1] [,2] [,3]
[1,]    4    1    0
[2,]    1    6    1
[3,]    0    1    8

对于OP请求的大小的数据而言,它的速度非常快。获得的数据为here

testWords <- read.csv("4000-most-common-english-words-csv.csv", stringsAsFactors = FALSE)

set.seed(1111)
myTestText <- lapply(1:100, function(x) {
         paste(testWords[sample(1000:1020, sample(30, 1), replace = TRUE),],collapse = " ")
    })

myTestText[[15]]
[1] "access restaurant video opinion video eventually fresh eventually
 reform credit publish judge Senate publish fresh restaurant publish
 version Senate critical release recall relation version"

system.time(test1 <- CommonWordsMatrixOld(myTestText))
 user  system elapsed 
0.625   0.009   0.646

这是输出:

test1[1:10,1:10]
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    9    3    5    1    3    4    4    2    2     1
 [2,]    3    5    3    1    1    3    3    0    0     1
 [3,]    5    3   12    0    3    8    4    3    2     1
 [4,]    1    1    0    1    0    0    1    0    0     0
 [5,]    3    1    3    0    4    2    1    1    1     0
 [6,]    4    3    8    0    2   13    7    4    1     1
 [7,]    4    3    4    1    1    7   10    4    1     1
 [8,]    2    0    3    0    1    4    4    7    3     0
 [9,]    2    0    2    0    1    1    1    3    4     0
[10,]    1    1    1    0    0    1    1    0    0     2

<强>更新

这是一个更快的算法,它可以删除许多不必要的操作并利用lower.tri,同时保持非常一般。

CommonWordsMatrixNew <- function(vList) {
    v <- lapply(vList, function(x) tolower(strsplit(x, " ")[[1]]))
    s <- length(v)
    m <- do.call(rbind, lapply(1L:s, function(x) {
        c(rep(0L,(x-1L)), do.call(c, lapply(x:s, function(y) length(intersect(v[[x]], v[[y]])))))
    }))
    m[lower.tri(m)] <- t(m)[lower.tri(m)]
    m
}

为了让您了解性能提升,这里有一些基准测试。(应该注意OP的解决方案不是分割矢量,所以它不是真正的比较)。新算法几乎是OP解决方案的两倍。

microbenchmark(New=CommonWordsMatrixNew(myTestText), 
               Old=CommonWordsMatrixOld(myTestText),
               Pach=CommonWordsMatrixPach(PreSplit1), times = 10)
Unit: milliseconds
expr       min        lq      mean    median        uq      max neval
 New  78.64434  79.07127  86.10754  79.72828  81.39679 137.0695    10
 Old 321.49031 323.89835 326.61801 325.75221 328.50877 335.3306    10
Pach 138.34742 143.00504 145.35147 145.17376 148.34699 151.5535    10

identical(CommonWordsMatrixNew(myTestText), CommonWordsMatrixOld(myTestText), CommonWordsMatrixPach(PreSplit1))
[1] TRUE

新算法将strsplit的呼叫次数减少了n^2 - n次(例如,在上述示例中,strplit在原始算法中被称为10000次,并且在更新版本中仅100次。另外,由于得到的矩阵是对称的,因此不需要多次计算每个句子之间的相互作用,因此x = 1:s函数中的y = x:slapply。这些循环的计算次数从n^2减少到nth triangle number = (n*(n+1)/2)(例如,在上面的示例中,从100005050)。在那之后,我们依靠R中的索引功能,这通常比手动制造快得多。

答案 1 :(得分:0)

我发现事先分裂会提高速度,所以

CommonWordsMatrix <- function(vList) {
  v <- lapply(vList, tolower)
  do.call(rbind, lapply(v, function(x) {
    do.call(c, lapply(v, function(y) length(intersect(x, y))))
  }))
}

是一个不错的选择(x和y是单词的预分割向量)