如何使用或不使用文档术语矩阵制表术语频率数据?

时间:2015-04-22 09:00:44

标签: r nlp nltk text-processing text-analysis

我正在尝试将以下数据制成表格:

输入

Big Fat Apple          3
Small Fat Apple        2
Little Small Pear      1

预期产出:

Big = 3
Fat = 3+2=5
Apple = 3+2=5
Small = 2+1=3
Little = 1
Pear = 1

我试图获取文档术语矩阵将其视为语料库,但我无法找到一种方法,以“Big Fat Apple”实际出现在语料库中:“Big Fat Apple Big Fat Apple Big Fat Apple“。

有没有办法制作这样的表格?理想情况下,我希望以文档术语矩阵的形式输入它,以便我可以使用其他函数。

3 个答案:

答案 0 :(得分:1)

要将这样的数据框转换为语料库,您必须明确告诉它每个文本应该使用rep()

再现x次
d <- data.frame(
  text=c("Big Fat Apple",
         "Small Fat Apple",
         "Little Small Pear"),
  n = c(3,2,1),stringsAsFactors=FALSE)

library(tm)
corpus <- Corpus(VectorSource(rep(d$text,d$n)))
dtm <- DocumentTermMatrix(corpus)

然后,您可以计算字词频率(请参阅How to find term frequency within a DTM in R?)。

答案 1 :(得分:1)

使用来自@ scoa答案的样本数据,您可以尝试使用我的“splitstackshape”包中的ftrace,如下所示:

cSplit

答案 2 :(得分:1)

我可以建议quanteda package(文本数据的定量分析)。您可以通过标记和制表,或通过创建文档特征矩阵(此处,使用单个文档)来处理您想要的内容:

cat("Big Fat Apple          3
     Small Fat Apple        2
     Little Small Pear      1\n", file = "example.txt")
mydata <- read.table("example.txt", stringsAsFactors = FALSE)
mydata <- paste(with(mydata, rep(paste(V1, V2, V3), V4)), collapse = " ")
mydata
## [1] "Big Fat Apple Big Fat Apple Big Fat Apple Small Fat Apple Small Fat Apple Little Small Pear"

# use the quanteda package as an alternative to tm
install.packages("quanteda")
library(quanteda)
# can simply tokenize and tabulate
table(tokenize(mydata))
## apple    big    fat little   pear  small 
##     5      3      5      1      1      3 

# alternatively, can create a one-document document-term matrix
myDfm <- dfm(mydata)
## Creating a dfm from a character vector ...
## ... indexing 1 document
## ... tokenizing texts, found 18 total tokens
## ... cleaning the tokens, 0 removed entirely
## ... summing tokens by document
## ... indexing 6 feature types
## ... building sparse matrix
## ... created a 1 x 6 sparse dfm
## ... complete. Elapsed time: 0.011 seconds.
myDfm
## Document-feature matrix of: 1 document, 6 features.
## 1 x 6 sparse Matrix of class "dfmSparse"
## features
## docs    apple big fat little pear small
## text1     5   3   5      1    1     3

很高兴为您解答有关quanteda的任何问题,因为我们正在积极寻求改进。