我使用R来提取文本。下面的代码很适合从pdf中提取非粗体文本,但它忽略了粗体部分。有没有办法提取粗体和非粗体文本?
news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
library(pdftools)
library(tesseract)
library(tiff)
info <- pdf_info(news)
numberOfPageInPdf <- as.numeric(info[2])
numberOfPageInPdf
for (i in 1:numberOfPageInPdf){
bitmap <- pdf_render_page(news, page=i, dpi = 300, numeric = TRUE)
file_name <- paste0("page", i, ".tiff")
file_tiff <- tiff::writeTIFF(bitmap, file_name)
out <- ocr(file_name)
file_txt <- paste0("text", i, ".txt")
writeLines(out, file_txt)
}
答案 0 :(得分:2)
我喜欢使用tabulizer
库。这是一个小例子:
devtools::install_github("ropensci/tabulizer")
library(tabulizer)
news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
# note that you need to specify UTF-8 as the encoding, otherwise your special characters
# won't come in correctly
page1 <- extract_tables(news, guess=TRUE, page = 1, encoding='UTF-8')
page1[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "" "Division: 1" "" "" "" "" "Série: A"
[2,] "" "514" "" "Fontaine 1 KBSK 1" "" "" "303"
[3,] "1" "62529 WIRIG ANTHONY" "" "2501 1⁄2-1⁄2" "51560" "CZEBE ATTILLA" "2439"
[4,] "2" "62359 BRUNNER NICOLAS" "" "2443 0-1" "51861" "PICEU TOM" "2401"
[5,] "3" "75655 CEKRO EKREM" "" "2393 0-1" "10391" "GEIRNAERT STEVEN" "2400"
[6,] "4" "50211 MARECHAL ANDY" "" "2355 0-1" "35181" "LEENHOUTS KOEN" "2388"
[7,] "5" "73059 CLAESEN PIETER" "" "2327 1⁄2-1⁄2" "25615" "DECOSTER FREDERIC" "2373"
[8,] "6" "63614 HOURIEZ CLEMENT" "" "2304 1⁄2-1⁄2" "44954" "MAENHOUT THIBAUT" "2372"
[9,] "7" "60369 CAPONE NICOLA" "" "2283 1⁄2-1⁄2" "10430" "VERLINDE TIEME" "2271"
[10,] "8" "70653 LE QUANG KIM" "" "2282 0-1" "44636" "GRYSON WOUTER" "2269"
[11,] "" "" "< 2361 >" "12 - 20" "" "< 2364 >" ""
如果您只关心某些表,也可以使用locate_areas
函数指定特定区域。请注意,要使locate_areas
起作用,我必须先在本地下载该文件;使用URL返回错误。
您将注意到每个表都是返回列表中的自己的元素。
以下是使用自定义区域选择每个页面上的第一个表格的示例:
customArea <- extract_tables(news, guess=FALSE, page = 1, area=list(c(84,27,232,569), encoding = 'UTF-8')
这也是比使用OCR(光学字符识别)库tesseract
更直接的方法,因为您不依赖于OCR库将像素排列转换回文本。在数字PDF中,每个文本元素都有一个x和y位置,tabulizer
库使用该信息来检测表启发式并提取合理格式的数据。你会发现你还有一些需要做的事,但它很容易管理。
编辑:只是为了好玩,这里有一个用data.table
开始清理的小例子
library(data.table)
cleanUp <- setDT(as.data.frame(page1[[1]]))
cleanUp[ , `:=` (Division = as.numeric(gsub("^.*(\\d+{1,2}).*", "\\1", grep('Division', cleanUp$V2, value=TRUE))),
Series = as.character(gsub(".*:\\s(\\w).*","\\1", grep('Série:', cleanUp$V7, value=TRUE))))
][,ID := tstrsplit(V2," ", fixed=TRUE, keep = 1)
][, c("V1", "V3") := NULL
][-grep('Division', V2, fixed=TRUE)]
我们已将Division
,Series
和ID
移至自己的列中,并删除了Division
标题行。这只是一般性的想法,需要稍微改进才能适用于所有27页。
V2 V4 V5 V6 V7 Division Series ID
1: 514 Fontaine 1 KBSK 1 303 1 A 514
2: 62529 WIRIG ANTHONY 2501 1/2-1/2 51560 CZEBE ATTILLA 2439 1 A 62529
3: 62359 BRUNNER NICOLAS 2443 0-1 51861 PICEU TOM 2401 1 A 62359
4: 75655 CEKRO EKREM 2393 0-1 10391 GEIRNAERT STEVEN 2400 1 A 75655
5: 50211 MARECHAL ANDY 2355 0-1 35181 LEENHOUTS KOEN 2388 1 A 50211
6: 73059 CLAESEN PIETER 2327 1/2-1/2 25615 DECOSTER FREDERIC 2373 1 A 73059
7: 63614 HOURIEZ CLEMENT 2304 1/2-1/2 44954 MAENHOUT THIBAUT 2372 1 A 63614
8: 60369 CAPONE NICOLA 2283 1/2-1/2 10430 VERLINDE TIEME 2271 1 A 60369
9: 70653 LE QUANG KIM 2282 0-1 44636 GRYSON WOUTER 2269 1 A 70653
10: 12 - 20 < 2364 > 1 A NA
答案 1 :(得分:1)
无需浏览PDF - &gt; TIFF - &gt; OCR循环,因为summarise()
可以直接读取此文件:
pdftools::pdf_text()