R从pdf中提取数据

时间:2016-08-05 13:10:20

标签: r pdf extract

这是数据

http://drdpat.bih.nic.in/Downloads/Rice-Varieties-1996-2012.pdf

这是一个pdf。如果你打开pdf,你将在第2页,我需要提取一个表并将其存储在数据帧中。我按照这个链接来做这个

https://ropensci.org/blog/2016/03/01/pdftools-and-jeroen

library(pdftools)
text <- pdf_text("data.pdf")
dat<-text[2] # this reads the second page 

在此之后,无论我尝试什么,它都不会将其转换为表格格式。一世 试过这个:

dat1 <- matrix(dat, byrow = TRUE,nrow = 12, ncol = 8) # it didn't work

尝试使用扫描功能

dat.s <- scan(dat, what = "character", sep = " ", skip = 2) # no use

任何人都可以帮我吗?另外,我只想在R

中实现这一目标

由于

1 个答案:

答案 0 :(得分:0)

PDF中表格的结构有些混乱:某些列相互重叠,并且tabulizer算法无法正确提取它们。

我只能从第2页中提取前6列;最后两列(显着特征,“推荐用于培养”)仍然存在问题...

library(tabulizer)
library(dplyr)

out1 <- extract_tables("Rice-Varieties-1996-2012.pdf", pages=2)[[1]]

## With a moderate amount of hacking,
## the following columns are correctly extracted:
## 1. Sl. No.
## 4. Year of Notification
## 5. Duration (in days)
## 6. Eco-System

sel <- gsub(" ","",out1[ ,c(1,4,5,6)])

## To extract Parentage column, you can use the `area` parameter:
## I figured out the values by trial and error
out2 <- extract_tables("Rice-Varieties-1996-2012.pdf", guess=FALSE,
                       pages=2,
                       area=list(c(80,120,2000,420) ) )[[1]]
sel <- cbind(sel,out2[1:nrow(sel),1])

## The header is contained in the first 3 rows of `sel`
## which can be aggregated by `paste0`
print(sel)
head <- aggregate(sel[1:3, ], by=list(rep(1,3)), paste0, collapse="") %>%
    select(-Group.1)

## The body is a bit harder, because each record might be split across
## a variable number of rows, depending on the entries.
## I have used non-empty records for column 1 (Sl.No.)
## to identify the breakpoints where to split sel into row blocks
## pertaining to the same record.
body <- sel[-(1:3), ]
brks <- body[ ,1]!=""
ibrk <- c((1:nrow(body))[brks], nrow(body)+1)
ll <- unlist(sapply(1:(length(ibrk)-1), function(k) rep(ibrk[k],ibrk[k+1]-ibrk[k])))

stopifnot(length(ll)==nrow(body))

body <- data.frame(body, stringsAsFactors=FALSE)
colnames(body) <- head

tab <- aggregate(body, by=list(ll), paste0, collapse="") %>%
    select(-Group.1)

print(tab)

## Using the same trick as above with brks and ibrk,
## one is able to extract column "Name of variety"
## (again, I found the values of area by trial and error).
out3 <- extract_tables("Rice-Varieties-1996-2012.pdf", guess=FALSE,
                       pages=2,
                       area=list(c(80,20,2000,130) ) )[[1]]
sel3 <- gsub(" ","",out3)
head3 <- aggregate(sel3[1:2, ], by=list(rep(1,2)), paste0, collapse="") %>%
    select(-Group.1)
body3 <- sel3[-(1:2), ]
brks3 <- body3[ ,1]!=""
ibrk3 <- c((1:nrow(body3))[brks3], nrow(body3)+1)
ll3 <- unlist(sapply(1:(length(ibrk3)-1), function(k) rep(ibrk3[k],ibrk3[k+1]-ibrk3[k])))

stopifnot(length(ll3)==nrow(body3))
body3 <- data.frame(body3, stringsAsFactors=FALSE)
colnames(body3) <- head3

tab3 <- aggregate(body3, by=list(ll3), paste0, collapse="") %>%
    select(-Group.1)

print(tab3)

## I have not managed to find a value of `area` which correctly splits
## the last two columns *and* allows to identify the rows in each record...

tab <- tab %>% left_join(tab3)