Question

使用PDF Tools pacakge中的pdf_text代码从PDF转换文件时，尝试使用FindAssocs代码时出现问题。

我已经找到了这个问题。这是因为我无法使用＆＃34; readLines＆＃34; Corpus为PDF中的每个页面创建一个单独的区域。因此，当我访问FindAssocs时，它会返回1，因为它们位于两个页面上。

有解决方法吗？供参考：下面的代码。

提前致谢:)。

text <- pdf_text(file.choose())
docs <- Corpus(VectorSource(text))
inspect(docs)

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", 
x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("dutch"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

as.data.frame(findAssocs(dtm, terms = input$v, corlimit = 0.3))

Answer 1

如果您想将加载df$binary <- as.factor(df$binary) ggplot(df, aes(x = variable1, y = variable2, colour = binary, shape = binary, alpha = binary)) + geom_point(size = 2) + scale_colour_manual(values = c("blue", "red")) + scale_shape_manual(values=c(16,17)) + scale_alpha_manual(values=c(1, 0.5)) + theme_bw()的所有页面合并到一个字段中，可以在将文本转换为语料库之前使用pdf_text。

paste(unlist(text), collapse =" ")

折叠＆＃34; docs＆＃34;用于在R中进行文本化

1 个答案: