从R中的pdf提取表

时间:2018-09-05 13:49:36

标签: r pdf

我需要从pdf中提取表格。这是链接

https://ainfo.cnptia.embrapa.br/digital/bitstream/item/155505/1/doc-202-1.pdf

我想从第15页至第21页中提取表。所有这些表都具有相同的结构(18列)和标题。这是单个表的快照。       enter image description here

在每个表格中,我只对第6-8和17列感兴趣:CicloGraus Dias/dias,Epcaja de Plantion and适应性研究

这就是我所做的:

library(dplyr)
library(tabulizer)

out <- extract_tables("mydocument.pdf"), pages = c(15:21))

# this gives me a list of 7 tables. 

temp <- data.frame(out[[1]]) # taking the first table as an example
temp %>% dplyr::select(X3, X4, X5, X12) # these are the columns corresponding to `Ciclo`, `Graus Dias/dias`, Epcaja de Plantion` and `Regiao de adaptacao`

# this is a snapshot of first table

enter image description here

但是,当我提取第7张表时:

  temp <- data.frame(out[[7]])

#  Column 1: 4 are merged into a single column. 

enter image description here

总而言之,extract_tables函数在某些表中没有保持一致的列位置并合并列。如何修复它,使我拥有
一个csv文件中包含Ciclo , Graus Dias / dias , Epcaja de PlantionRegiao de adaptacao列的组合表。

1 个答案:

答案 0 :(得分:0)

在我的经验中,这是一个数据准备和争执的问题,而不是一个解析问题,因为在这种情况下,制表器的解析算法除了在方法之间进行更改外没有太多余地。从我可以看到的当我尝试提取表时,不仅是错误解析的第7页表。每个页面的解析方式都不一样,但是所有数据似乎都保留了下来。我可以看到您的第一张表有13列,第二列是17、3、12、4、10,最后三列是11列。我建议做的是分别解析每个页面,并根据每个页面上所需的输出执行数据清理,然后将它们绑定在一起。这是一个漫长的过程,并且非常针对每个解析的表,因此我仅提供示例脚本:

library(dplyr)
library(tidyr)
library(tabulizer)
# I create a dummy list to iterate through all the pages and push a data.frame in
result <- list()
for (i in 15:21){
  out <- as.data.frame(extract_tables("mydocument.pdf", page = i, method = 'stream'), stringsAsFactors = FALSE)
  result[[i]] <- out
}
# Remove excess list items -
# there is probably a better way to do this from within the for loop
result <- result[-(1:14)]

## ------- DATA CLEANING OPERATIONS examples:
# Remove top 3x lines from the first page of table1 not part of data
result[[1]] <- result[[1]][-(1:3),]
# Perform data cleaning operations such as split/ merge columns according to your liking
# for instance if you want to split column X1 into 4 (as in your original post), you can do that by splitting by whitespace
result[[1]] <- separate(result[[1]], 1, into = c('X1.1','X1.2','X1.3', 'X1.4'),sep = ' ', remove = TRUE)

## ---- After data cleaning operations:
# Bind all dataframes (they should have equal number of columns by now into one and make sure the colnames match as well)
df <-bind_rows(result)
# Write your output csv file
write.csv(df, 'yourfilename.csv')

另外,您可能想看看制表器的不同解析方法(我在这里将其设置为“流”,因为根据我的经验,这通常会产生最佳结果,但是“格子”在某些情况下可能会更好地工作)表格)。