请提出任何建议。这是时间敏感的。我的PDF报告主要是文本块。它们是长篇报道(约50-100页)。我正在尝试编写一个R脚本,它能够使用开始/停止位置字符串提取这些PDF报告的特定部分。注意:报告的长度不同。简短的例子:
DOCUMENT TITLE
01. SECTION 1
This is a test section that I DONT want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
02. SECTION 2
This is a test section that I do want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
11. SECTION 11
This is a test section that I do want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
12. SECTION 12
This is a test section that I DONT want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
因此,本例中的目标是提取第2节下面的段落并将其存储为字段/数据点。我还想将第11节存储为字段/数据点。 请注意文档为PDF格式
我尝试过使用pdftools,tm,stringr,我花了20多个小时搜索解决方案和教程如何做到这一点。我知道这是可能的,因为我之前使用SAS做过......
请参阅下面的代码,我在问题中添加了评论。我相信RegEx将成为解决方案的一部分,但我迷失了。
# Init Step
libs <- c("tm","class","stringr","testthat",
"pdftools")
lapply(libs, require, character.only= TRUE)
# File name & location
filename = "~/pdf_test/test.pdf"
# converting PDF to text
textFile <- pdf_text(filename)
cat(textFile[1]) # Text of pg. 1 of PDF
cat(textFile[2]) # Text of pg. 2 of PDF
# I'm at a loss of how to parse the values I want. I have seen things
like:
sectionxyz <- str_extract_all(textFile, #??? )
rm_between()
# 1) How do I loop through each page of PDF file?
# 2) How do I identify start/stop positions for section to be extracted?
# 3) How do I add logic to extract text between start/stop positions
# and then add the result to a data field?
# 4) Sections in PDF will be long sections of text (i.e. 100+ words into a field)
NEW ------
所以我能够: -Prep doc正确 - 识别正确的启动/停止模式:
length(grep("^11\\. LIMITS OF LIABILITY( +){1}$",source_main2))
length(grep("Applicable\\s+[Ll]imits\\s+[Oo]f",source_main2))
pat_st_lol <- "^11\\. LIMITS OF LIABILITY( +){1}$"
pat_ed_lol <- "Applicable\\s+[Ll]imits\\s+[Oo]f"
length(grep())
语句仅验证是否找到了1个实例。从这里开始,我很遗憾基于如何使用gsub或类似方法来提取我想要的数据部分。我试过了:
pat <- paste0(".*",pat_st_lol,"(.*)",pat_ed_lol,".*")
test <- gsub(".*^11\\. LIMITS OF LIABILITY( +){1}$(.*)\n",
"Applicable\\s+[Ll]imits\\s+[Oo]f", source_main2)
test2 <-gsub(".*pat_st_lol(.*)\npat_ed_lol.*")
到目前为止,进展甚微,但无论如何都取得了进展。
答案 0 :(得分:0)
如果您可以系统地识别您需要的部分,您可以按照指示使用正则表达式提取所需的文本。
在上面的示例中,gsub(".*SECTION 11(.*)\n12\\..*","\\1",string)
之类的内容应该有用。
现在,您可以使用paste
动态定义模式并迭代所有文件。然后,每个结果都可以保存在data.frame, list,...
。
以下是对该模式的简要更详细的解释:
.*
是匹配&#34;任何&#34;的方式。如果您想匹配数字,可以使用\\d
或等效[0-9]
。 Here是R中Regex的简短介绍(我发现它非常有用),你可以找到几个字符类。.*
位于模式边缘意味着(.*)
表示我们想要的内容(因此,此处匹配使用.*
的任何内容)。基本上它意味着提取物&#34;任何东西&#34;在SECTION 11
和12.
\\.
表示点,\n
是&#34;换行符&#34;元字符(如前所述&#34; 12。&#34;,新行开始)gsub(".*(\\d{2}\\:\\d{2})", "\\1","18.05.2018, 21:37")
将返回21:37
,或gsub("([A-z]) \\d+","\\1","hello 123")
将提供hello
。gsub
中的第二个参数可以并且通常用于提供替代,即用匹配模式替换的东西。然而,在这里,我们不想要任何替代,我们想要提取一些东西。 \\1
表示提取第一个分组,即它在第一个括号内的内容(您可以有多个分组)。string
是我们要提取的字符串,即PDF文件现在,如果你想在循环中执行类似的操作,你可以执行以下操作:
# we are in the loop
# first is your starting point in the extraction, i.e. "SECTION 11"
# last is your end point, i.e. "12."
first <- "SECTION 11" # first and last can be dynamically assigned
last <- "12\\." # "\\" is added before the dot as "." is a Regex metachar
# If last doesn't systematically contain a dot
# you could use gsub to add "\\" before the dot when needed:
# gsub("\\.","\\\\.",".") returns "\\."
# so gsub("\\.","\\\\.","12.") returns "12\\."
pat <- paste0(".*",first,"(.*)","\n",last,".*") #"\n" is added to stop before the newline, but it could be omitted (then "\n" might appear in the extraction)
gsub(pat,"\\1",string) # returns the same as above