Question

我使用的是 R 编程语言。我正在尝试使用以下网站学习如何总结文本文章：https://www.hvitfeldt.me/blog/tidy-text-summarization-using-textrank/

按照说明，我从网站上复制了代码（我使用了一些我在网上找到的随机 PDF）：

library(tidyverse)
## Warning: package 'tibble' was built under R version 3.6.2
library(tidytext)
library(textrank)
library(rvest)
## Warning: package 'xml2' was built under R version 3.6.2

url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"


article <- read_html(url) %>%
  html_nodes('div[class="padded"]') %>%
  html_text()


article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words <- article_words %>%
  anti_join(stop_words, by = "word")

到目前为止一切正常。

以下部分是问题所在：

 article_summary <- textrank_sentences(data = article_sentences, 
                                      terminology = article_words)

Error in textrank_sentences(data = article_sentences, terminology = article_words) : 
  nrow(data) > 1 is not TRUE

有人可以告诉我我做错了什么吗？上述程序是否不适用于“pdf”文件？

这是一个可能的解决方案吗？如果我复制/粘贴此 pdf 中的整个文本并将其分配给“文章”对象，然后继续执行其余代码会怎样？

例如article <- "blah blah blah ..... blah blah blah"

谢谢

Answer 1

您共享的链接从网页中读取数据。 div[class="padded"] 特定于他们正在阅读的网页。它不适用于任何其他网页或您尝试从中读取数据的 pdf。您可以使用 pdftools 包从 pdf 中读取数据。

library(pdftools)
library(tidytext)
library(textrank)

url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words <- article_words %>%
  anti_join(stop_words, by = "word")

article_summary <- textrank_sentences(data = article_sentences, terminology = article_words)

R: textrank_sentences(data = article_sentences, terminology = article_words) 中的错误：nrow(data) > 1 is not TRUE

1 个答案: