我需要从pdf中提取表格。这是链接
https://ainfo.cnptia.embrapa.br/digital/bitstream/item/155505/1/doc-202-1.pdf
我想从第15页至第21页中提取表。所有这些表都具有相同的结构(18列)和标题。这是单个表的快照。
在每个表格中,我只对第6-8和17列感兴趣:Ciclo
,Graus Dias/dias
,Epcaja de Plantion and
适应性研究
这就是我所做的:
library(dplyr)
library(tabulizer)
out <- extract_tables("mydocument.pdf"), pages = c(15:21))
# this gives me a list of 7 tables.
temp <- data.frame(out[[1]]) # taking the first table as an example
temp %>% dplyr::select(X3, X4, X5, X12) # these are the columns corresponding to `Ciclo`, `Graus Dias/dias`, Epcaja de Plantion` and `Regiao de adaptacao`
# this is a snapshot of first table
但是,当我提取第7张表时:
temp <- data.frame(out[[7]])
# Column 1: 4 are merged into a single column.
总而言之,extract_tables
函数在某些表中没有保持一致的列位置并合并列。如何修复它,使我拥有
一个csv文件中包含Ciclo ,
Graus Dias / dias , Epcaja de Plantion
和Regiao de adaptacao
列的组合表。
答案 0 :(得分:0)
在我的经验中,这是一个数据准备和争执的问题,而不是一个解析问题,因为在这种情况下,制表器的解析算法除了在方法之间进行更改外没有太多余地。从我可以看到的当我尝试提取表时,不仅是错误解析的第7页表。每个页面的解析方式都不一样,但是所有数据似乎都保留了下来。我可以看到您的第一张表有13列,第二列是17、3、12、4、10,最后三列是11列。我建议做的是分别解析每个页面,并根据每个页面上所需的输出执行数据清理,然后将它们绑定在一起。这是一个漫长的过程,并且非常针对每个解析的表,因此我仅提供示例脚本:
library(dplyr)
library(tidyr)
library(tabulizer)
# I create a dummy list to iterate through all the pages and push a data.frame in
result <- list()
for (i in 15:21){
out <- as.data.frame(extract_tables("mydocument.pdf", page = i, method = 'stream'), stringsAsFactors = FALSE)
result[[i]] <- out
}
# Remove excess list items -
# there is probably a better way to do this from within the for loop
result <- result[-(1:14)]
## ------- DATA CLEANING OPERATIONS examples:
# Remove top 3x lines from the first page of table1 not part of data
result[[1]] <- result[[1]][-(1:3),]
# Perform data cleaning operations such as split/ merge columns according to your liking
# for instance if you want to split column X1 into 4 (as in your original post), you can do that by splitting by whitespace
result[[1]] <- separate(result[[1]], 1, into = c('X1.1','X1.2','X1.3', 'X1.4'),sep = ' ', remove = TRUE)
## ---- After data cleaning operations:
# Bind all dataframes (they should have equal number of columns by now into one and make sure the colnames match as well)
df <-bind_rows(result)
# Write your output csv file
write.csv(df, 'yourfilename.csv')
另外,您可能想看看制表器的不同解析方法(我在这里将其设置为“流”,因为根据我的经验,这通常会产生最佳结果,但是“格子”在某些情况下可能会更好地工作)表格)。