我使用的是 R 编程语言。我正在尝试使用以下网站学习如何总结文本文章:https://www.hvitfeldt.me/blog/tidy-text-summarization-using-textrank/
按照说明,我从网站上复制了代码(我使用了一些我在网上找到的随机 PDF):
library(tidyverse)
## Warning: package 'tibble' was built under R version 3.6.2
library(tidytext)
library(textrank)
library(rvest)
## Warning: package 'xml2' was built under R version 3.6.2
url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"
article <- read_html(url) %>%
html_nodes('div[class="padded"]') %>%
html_text()
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words <- article_words %>%
anti_join(stop_words, by = "word")
到目前为止一切正常。
以下部分是问题所在:
article_summary <- textrank_sentences(data = article_sentences,
terminology = article_words)
Error in textrank_sentences(data = article_sentences, terminology = article_words) :
nrow(data) > 1 is not TRUE
有人可以告诉我我做错了什么吗?上述程序是否不适用于“pdf”文件?
这是一个可能的解决方案吗?如果我复制/粘贴此 pdf 中的整个文本并将其分配给“文章”对象,然后继续执行其余代码会怎样?
例如article <- "blah blah blah ..... blah blah blah"
谢谢
答案 0 :(得分:1)
您共享的链接从网页中读取数据。 div[class="padded"]
特定于他们正在阅读的网页。它不适用于任何其他网页或您尝试从中读取数据的 pdf。您可以使用 pdftools
包从 pdf 中读取数据。
library(pdftools)
library(tidytext)
library(textrank)
url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words <- article_words %>%
anti_join(stop_words, by = "word")
article_summary <- textrank_sentences(data = article_sentences, terminology = article_words)