Question

我需要从pdf中提取表格。这是链接

https://ainfo.cnptia.embrapa.br/digital/bitstream/item/155505/1/doc-202-1.pdf

我想从第15页至第21页中提取表。所有这些表都具有相同的结构（18列）和标题。这是单个表的快照。

在每个表格中，我只对第6-8和17列感兴趣：Ciclo，Graus Dias/dias，Epcaja de Plantion and适应性研究

这就是我所做的：

library(dplyr)
library(tabulizer)

out <- extract_tables("mydocument.pdf"), pages = c(15:21))

# this gives me a list of 7 tables. 

temp <- data.frame(out[[1]]) # taking the first table as an example
temp %>% dplyr::select(X3, X4, X5, X12) # these are the columns corresponding to `Ciclo`, `Graus Dias/dias`, Epcaja de Plantion` and `Regiao de adaptacao`

# this is a snapshot of first table

但是，当我提取第7张表时：

  temp <- data.frame(out[[7]])

#  Column 1: 4 are merged into a single column.

总而言之，extract_tables函数在某些表中没有保持一致的列位置并合并列。如何修复它，使我拥有
一个csv文件中包含Ciclo , Graus Dias / dias , Epcaja de Plantion和Regiao de adaptacao列的组合表。

Answer 1

在我的经验中，这是一个数据准备和争执的问题，而不是一个解析问题，因为在这种情况下，制表器的解析算法除了在方法之间进行更改外没有太多余地。从我可以看到的当我尝试提取表时，不仅是错误解析的第7页表。每个页面的解析方式都不一样，但是所有数据似乎都保留了下来。我可以看到您的第一张表有13列，第二列是17、3、12、4、10，最后三列是11列。我建议做的是分别解析每个页面，并根据每个页面上所需的输出执行数据清理，然后将它们绑定在一起。这是一个漫长的过程，并且非常针对每个解析的表，因此我仅提供示例脚本：

library(dplyr)
library(tidyr)
library(tabulizer)
# I create a dummy list to iterate through all the pages and push a data.frame in
result <- list()
for (i in 15:21){
  out <- as.data.frame(extract_tables("mydocument.pdf", page = i, method = 'stream'), stringsAsFactors = FALSE)
  result[[i]] <- out
}
# Remove excess list items -
# there is probably a better way to do this from within the for loop
result <- result[-(1:14)]

## ------- DATA CLEANING OPERATIONS examples:
# Remove top 3x lines from the first page of table1 not part of data
result[[1]] <- result[[1]][-(1:3),]
# Perform data cleaning operations such as split/ merge columns according to your liking
# for instance if you want to split column X1 into 4 (as in your original post), you can do that by splitting by whitespace
result[[1]] <- separate(result[[1]], 1, into = c('X1.1','X1.2','X1.3', 'X1.4'),sep = ' ', remove = TRUE)

## ---- After data cleaning operations:
# Bind all dataframes (they should have equal number of columns by now into one and make sure the colnames match as well)
df <-bind_rows(result)
# Write your output csv file
write.csv(df, 'yourfilename.csv')

另外，您可能想看看制表器的不同解析方法（我在这里将其设置为“流”，因为根据我的经验，这通常会产生最佳结果，但是“格子”在某些情况下可能会更好地工作）表格）。

从R中的pdf提取表

1 个答案: