从pdf文本到整洁的数据框(在文档列中包含文件名)

时间:2018-08-16 13:57:26

标签: r pdf text-mining corpus tidytext

我想分析近300个pdf文档中的文本。现在,我使用了pdftoolstmtidytext包来读取文本,将其覆盖为语料库,然后覆盖为文档术语矩阵,最后我想将其结构化为整洁的数据框。

我有几个问题:

  • 如何摆脱页面数据(在每个pdf页面的顶部和/或底部)
  • 我希望将文件名作为document列中的值,而不是索引编号。
  • 以下代码仅包含2个pdf文件,以确保再现性。运行所有文件时,我的corpus对象中有294个文档,但是当我整理它时,我似乎会丢失一些文件,因为converted %>% distinct(document)会返回275。我不知道为什么会这样。

我有以下可重现的脚本:

library(tidyverse)
library(tidytext)
library(pdftools)
library(tm)
library(broom)

# Create a temporary empty directory 
# (don't worry at the end of this script I'll remove this directory and its files)

dir.create("~/Desktop/sample-pdfs")

# Fill directory with 2 pdf files from my github repo

download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/'s-Gravenhage_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/'s-Gravenhage_coalitieakkoord.pdf")
download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/Aa%20en%20Hunze_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/Aa en Hunze_coalitieakkoord.pdf")

# Create vector of file paths

dir <- "~/Desktop/sample-pdfs"
pdfs <- paste(dir, "/", list.files(dir, pattern = "*.pdf"), sep = "")

# Read the text from pdf's with pdftools package

pdfs_text <- map(pdfs, pdf_text)

# Convert to document-term-matrix

converted <- Corpus(VectorSource(pdfs_text)) %>%
          DocumentTermMatrix()

# Now I want to convert this to a tidy format

converted %>%
          tidy() %>%
          filter(!grepl("[0-9]+", term))

具有以下输出:

# A tibble: 5,305 x 3
   document term           count
   <chr>    <chr>          <dbl>
 1 1        aan              158
 2 1        aanbesteding       2
 3 1        aanbestedingen     1
 4 1        aanbevelingen      1
 5 1        aanbieden          3
 6 1        aanbieders         1
 7 1        aanbod             8
 8 1        aandacht          16
 9 1        aandachtspunt      3
10 1        aandeel            1
# ... with 5,295 more rows

这似乎很好,但是我希望将文件名("'s-Gravenhage""Aa en Hunze")作为文档列中的值而不是索引编号。我该怎么做?

所需的输出:

# A tibble: 5,305 x 3
   document      term           count
   <chr>         <chr>          <dbl>
 1 's-Gravenhage aan              158
 2 's-Gravenhage aanbesteding       2
 3 's-Gravenhage aanbestedingen     1
 4 's-Gravenhage aanbevelingen      1
 5 's-Gravenhage aanbieden          3
 6 's-Gravenhage aanbieders         1
 7 's-Gravenhage aanbod             8
 8 's-Gravenhage aandacht          16
 9 's-Gravenhage aandachtspunt      3
10 's-Gravenhage aandeel            1
# ... with 5,295 more rows

从运行以下行的桌面删除下载的文件及其目录:

unlink("~/Desktop/sample-pdfs", recursive = TRUE)

非常感谢所有帮助!

4 个答案:

答案 0 :(得分:2)

您可以使用tm将文档直接读入语料库。读者readPDF使用pdftools作为引擎。无需先创建一组文本,将其放入语料库即可获得输出。我创建了2个示例。第一个与您所做的一致,但首先要经过语料库。第二个纯粹基于tidyverse + tidytext。无需在tm,tidytext等之间切换。

示例之间令牌数量的差异是由于tidytext / tokenizer中的自动清除。

如果要处理大量文档,您可能希望使用enable_binary_protocol=true作为工作重点,因为它可以开箱即用地在多个内核上工作,并且可以加快令牌生成器的速度。不要忘记使用quanteda软件包来获得一份很好的荷兰停用词列表。如果您需要POS标记荷兰语单词,请检查stopwords软件包。

updipe

只使用tidytext而不是tm

library(tidyverse)
library(tidytext)
library(tm)

directory <- "D:/sample-pdfs"

# create corpus from pdfs
converted <- VCorpus(DirSource(directory), readerControl = list(reader = readPDF)) %>% 
  DocumentTermMatrix()


converted %>%
  tidy() %>%
  filter(!grepl("[0-9]+", term))

# A tibble: 5,707 x 3
   document                          term           count
   <chr>                             <chr>          <dbl>
 1 's-Gravenhage_coalitieakkoord.pdf "\ade"             4
 2 's-Gravenhage_coalitieakkoord.pdf "\adeze"           1
 3 's-Gravenhage_coalitieakkoord.pdf "\aeen"            2
 4 's-Gravenhage_coalitieakkoord.pdf "\aer"             2
 5 's-Gravenhage_coalitieakkoord.pdf "\aextra"          2
 6 's-Gravenhage_coalitieakkoord.pdf "\agroei"          1
 7 's-Gravenhage_coalitieakkoord.pdf "\ahet"            1
 8 's-Gravenhage_coalitieakkoord.pdf "\amet"            1
 9 's-Gravenhage_coalitieakkoord.pdf "\aonderwijs,"     1
10 's-Gravenhage_coalitieakkoord.pdf "\aop"            11
# ... with 5,697 more rows

答案 1 :(得分:1)

我建议为要执行的操作编写包装函数,这样就可以将每个文件名添加为一列。

read_PDF <- function(file){

    pdfs_text <- pdf_text(file)
    converted <- Corpus(VectorSource(pdfs_text)) %>%
          DocumentTermMatrix()
    converted %>%
          tidy() %>%
          filter(!grepl("[0-9]+", term)) %>%

          # add FileName as a column
          mutate(FileName = file)
}

final <- map(pdfs, read_PDF) %>% data.table::rbindlist()

答案 2 :(得分:1)

很好的例子!

  • 我添加了几行来添加名称。
  • 不确定丢失文件的行为,我没有得到这种行为。
  • 仅提及您的文件名不是很标准,建议再次检查文件名,并且在第一个文件的开头也有撇号。还将建议清洁空间。
  • 我用英语文档进行了测试,您可以在语料库中添加其他语言。

代码如下:

library(tidyverse)
library(tidytext)
library(pdftools) 
library(tm)
library(broom)

# Create a temporary empty directory

dir <- "PDFs/"
pdfs <- paste0(dir, list.files(dir, pattern = "*.pdf"))
names <- list.files(dir, pattern = "*.pdf")

# create a table of names
namesDocs <- 
    names %>% 
    str_remove(pattern = ".pdf") %>% 
    as.tibble() %>% 
    mutate(ids = as.character(seq_along(names)))

namesDocs
# Read the text from pdf's with pdftools package

pdfs_text <- map(pdfs, pdftools::pdf_text)

# Convert to document-term-matrix
# add cleaning process

converted <-
    Corpus(VectorSource(pdfs_text)) %>%
    DocumentTermMatrix(
        control = list(removeNumbers = TRUE,
                       stopwords = TRUE,
                       removePunctuation = TRUE))

converted
# Now I want to convert this to a tidy format
# add names of documents

mytable <-
  converted %>%
  tidy() %>%
  arrange(desc(count)) %>% 
  left_join(y = namesDocs, by = c("document" = "ids"))

head(mytable)

View(mytable)

答案 3 :(得分:0)

我认为我在网上找到的最简单的邮件是来自朱利安·布伦Text minning

您需要两个包裹

library("readtext")
library("quanteda")

对于此代码,将您的PDF命名为 Author_date ,然后将其放置在工作目录的文件夹中,例如,我将pdf放置在 PDFs 文件夹

    # set path to the PDF 
pdf_path <- "PDFs/"

# List the PDFs 
pdfs <- list.files(path = pdf_path, pattern = 'pdf$',  full.names = TRUE) 

# Import the PDFs into R
spill_texts <- readtext(pdfs, 
                        docvarsfrom = "filenames", 
                        sep = "_", 
                        docvarnames = c("First_author", "Year"))

# Transform the pdfs into a corpus object
spill_corpus  <- corpus(spill_texts)
spill_corpus
# Some stats about the pdfs
tokenInfo <- summary(spill_corpus)
tokenInfo