我正在尝试将以下数据制成表格:
输入
Big Fat Apple 3
Small Fat Apple 2
Little Small Pear 1
预期产出:
Big = 3
Fat = 3+2=5
Apple = 3+2=5
Small = 2+1=3
Little = 1
Pear = 1
我试图获取文档术语矩阵将其视为语料库,但我无法找到一种方法,以“Big Fat Apple”实际出现在语料库中:“Big Fat Apple Big Fat Apple Big Fat Apple“。
有没有办法制作这样的表格?理想情况下,我希望以文档术语矩阵的形式输入它,以便我可以使用其他函数。
答案 0 :(得分:1)
要将这样的数据框转换为语料库,您必须明确告诉它每个文本应该使用rep()
d <- data.frame(
text=c("Big Fat Apple",
"Small Fat Apple",
"Little Small Pear"),
n = c(3,2,1),stringsAsFactors=FALSE)
library(tm)
corpus <- Corpus(VectorSource(rep(d$text,d$n)))
dtm <- DocumentTermMatrix(corpus)
然后,您可以计算字词频率(请参阅How to find term frequency within a DTM in R?)。
答案 1 :(得分:1)
使用来自@ scoa答案的样本数据,您可以尝试使用我的“splitstackshape”包中的ftrace
,如下所示:
cSplit
答案 2 :(得分:1)
我可以建议quanteda
package(文本数据的定量分析)。您可以通过标记和制表,或通过创建文档特征矩阵(此处,使用单个文档)来处理您想要的内容:
cat("Big Fat Apple 3
Small Fat Apple 2
Little Small Pear 1\n", file = "example.txt")
mydata <- read.table("example.txt", stringsAsFactors = FALSE)
mydata <- paste(with(mydata, rep(paste(V1, V2, V3), V4)), collapse = " ")
mydata
## [1] "Big Fat Apple Big Fat Apple Big Fat Apple Small Fat Apple Small Fat Apple Little Small Pear"
# use the quanteda package as an alternative to tm
install.packages("quanteda")
library(quanteda)
# can simply tokenize and tabulate
table(tokenize(mydata))
## apple big fat little pear small
## 5 3 5 1 1 3
# alternatively, can create a one-document document-term matrix
myDfm <- dfm(mydata)
## Creating a dfm from a character vector ...
## ... indexing 1 document
## ... tokenizing texts, found 18 total tokens
## ... cleaning the tokens, 0 removed entirely
## ... summing tokens by document
## ... indexing 6 feature types
## ... building sparse matrix
## ... created a 1 x 6 sparse dfm
## ... complete. Elapsed time: 0.011 seconds.
myDfm
## Document-feature matrix of: 1 document, 6 features.
## 1 x 6 sparse Matrix of class "dfmSparse"
## features
## docs apple big fat little pear small
## text1 5 3 5 1 1 3
很高兴为您解答有关quanteda
的任何问题,因为我们正在积极寻求改进。