使用正则表达式字典过滤TermDocumentMatrix

时间:2016-06-16 15:12:34

标签: r tm stringi

我觉得这应该相当容易。我有一个当前格式为globs的术语词典,我将其转换为正则表达式。我将它们转换为正则表达式的原因是因为我认为tm包只适用于它们。没关系。但我无法弄清楚如何通过传递多个字典术语来对termDocumentMatrix进行子集化。另一个扭曲是字典术语有多个长度,有些是1,有些是2,有些是3个字长。

以下是我目前的代码。

#load libraries
library(tm)
library(stringi)
#Load corpus crude part of tm package
data(crude)
#make tokenizer to account for multi-word dictionaries
myTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 1:3), paste, collapse = " "), 
use.names =   FALSE)
#make TermDocumentMatrix
tdm<-TermDocumentMatrix(crude, control=list(tokenizer=myTokenizer))
#Make dictionary of regular expressions
dict<-c('^also$', '^told reuters$', '^an emergency$', '^in world oil$')
#This is what I am working with
inspect(
tdm[sapply(dict, function(x) stri_detect_regex(tdm$dimnames$Terms,    
pattern=x)),]
)

1 个答案:

答案 0 :(得分:1)

我现在发现crude数据集是其中一个允许测试的软件包的一部分。这表明从图案中删除插入符号和美元符号可以找到与目标匹配的更多数量的项目:

> sum( sapply(dict, grepl, x=tdm$dimnames$Terms))
[1] 4
> dict2<-c('also', 'told reuters', 'an emergency', 'in world oil')
> sum( sapply(dict2, grepl, x=tdm$dimnames$Terms))
[1] 51

如果使用grep,您可以看到哪些匹配。 (只要tdm $ dimnames $ Terms:

,grepl的结果将是4倍
> sapply(dict2, grep, x=tdm$dimnames$Terms)
$also
 [1]  707  708  738  739  740  741  742  743  744  745  746  747  748  749  750  751  752  753
[19]  754 1485 1486 2434 2881 2882 2988 2989 3399 3400 3782 3983 5265 5995 6088 6382 6383 6893
[37] 7427 7428 7524 7525 7605

$`told reuters`
[1] 3013 7209 7210

$`an emergency`
[1]  779  780  781 2437 2642 4205

$`in world oil`
[1] 3276

TDM的打印方法并不特别有用,但您可以使用dput“爆炸”该值以查看其中的内容:

> dput(tdm[ sapply(dict2, grepl, x=tdm$dimnames$Terms), ] )
structure(list(i = c(1L, 2L, 3L, 8L, 9L, 33L, 3L, 16L, 17L, 20L, 
21L, 32L, 3L, 6L, 7L, 22L, 39L, 40L, 3L, 14L, 15L, 36L, 37L, 
38L, 3L, 12L, 13L, 27L, 28L, 41L, 3L, 10L, 11L, 25L, 26L, 30L, 
3L, 4L, 5L, 23L, 24L, 31L, 3L, 4L, 5L, 23L, 24L, 31L, 3L, 18L, 
19L, 29L, 34L, 35L), j = c(6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 
7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 
10L, 10L, 10L, 10L, 10L, 14L, 14L, 14L, 14L, 14L, 14L, 16L, 16L, 
16L, 16L, 16L, 16L, 17L, 17L, 17L, 17L, 17L, 17L, 18L, 18L, 18L, 
18L, 18L, 18L), v = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
    nrow = 41L, ncol = 20L, dimnames = structure(list(Terms = c("ali also", 
    "ali also delivered", "also", "also called", "also called for", 
    "also contributed", "also contributed to", "also delivered", 
    "also delivered \"a", "also denied", "also denied that", 
    "also nigerian", "also nigerian oil", "also no", "also no projection", 
    "also reviews", "also reviews the", "also was", "also was lowered", 
    "but also", "but also reviews", "european weekend also", 
    "group, also", "group, also called", "he also", "he also denied", 
    "is also", "is also nigerian", "louisiana sweet also", "meeting.\" he also", 
    "private group, also", "sector, but also", "sheikh ali also", 
    "sweet also", "sweet also was", "there was also", "was also", 
    "was also no", "weekend also", "weekend also contributed", 
    "who is also"), Docs = c("127", "144", "191", "194", "211", 
    "236", "237", "242", "246", "248", "273", "349", "352", "353", 
    "368", "489", "502", "543", "704", "708")), .Names = c("Terms", 
    "Docs"))), .Names = c("i", "j", "v", "nrow", "ncol", "dimnames"
), class = c("TermDocumentMatrix", "simple_triplet_matrix"), weighting = c("term frequency", 
"tf"))