Question

我正在尝试将以下数据制成表格：

输入

Big Fat Apple          3
Small Fat Apple        2
Little Small Pear      1

预期产出：

Big = 3
Fat = 3+2=5
Apple = 3+2=5
Small = 2+1=3
Little = 1
Pear = 1

我试图获取文档术语矩阵将其视为语料库，但我无法找到一种方法，以“Big Fat Apple”实际出现在语料库中：“Big Fat Apple Big Fat Apple Big Fat Apple“。

有没有办法制作这样的表格？理想情况下，我希望以文档术语矩阵的形式输入它，以便我可以使用其他函数。

Answer 1

要将这样的数据框转换为语料库，您必须明确告诉它每个文本应该使用rep()

再现x次

d <- data.frame(
  text=c("Big Fat Apple",
         "Small Fat Apple",
         "Little Small Pear"),
  n = c(3,2,1),stringsAsFactors=FALSE)

library(tm)
corpus <- Corpus(VectorSource(rep(d$text,d$n)))
dtm <- DocumentTermMatrix(corpus)

然后，您可以计算字词频率（请参阅How to find term frequency within a DTM in R?）。

Answer 2

使用来自@ scoa答案的样本数据，您可以尝试使用我的“splitstackshape”包中的ftrace，如下所示：

cSplit

Answer 3

我可以建议quanteda package（文本数据的定量分析）。您可以通过标记和制表，或通过创建文档特征矩阵（此处，使用单个文档）来处理您想要的内容：

cat("Big Fat Apple          3
     Small Fat Apple        2
     Little Small Pear      1\n", file = "example.txt")
mydata <- read.table("example.txt", stringsAsFactors = FALSE)
mydata <- paste(with(mydata, rep(paste(V1, V2, V3), V4)), collapse = " ")
mydata
## [1] "Big Fat Apple Big Fat Apple Big Fat Apple Small Fat Apple Small Fat Apple Little Small Pear"

# use the quanteda package as an alternative to tm
install.packages("quanteda")
library(quanteda)
# can simply tokenize and tabulate
table(tokenize(mydata))
## apple    big    fat little   pear  small 
##     5      3      5      1      1      3 

# alternatively, can create a one-document document-term matrix
myDfm <- dfm(mydata)
## Creating a dfm from a character vector ...
## ... indexing 1 document
## ... tokenizing texts, found 18 total tokens
## ... cleaning the tokens, 0 removed entirely
## ... summing tokens by document
## ... indexing 6 feature types
## ... building sparse matrix
## ... created a 1 x 6 sparse dfm
## ... complete. Elapsed time: 0.011 seconds.
myDfm
## Document-feature matrix of: 1 document, 6 features.
## 1 x 6 sparse Matrix of class "dfmSparse"
## features
## docs    apple big fat little pear small
## text1     5   3   5      1    1     3

很高兴为您解答有关quanteda的任何问题，因为我们正在积极寻求改进。

如何使用或不使用文档术语矩阵制表术语频率数据？

3 个答案: