如何从PDF文件中提取特定文本的部分到R数据框?复杂

时间:2018-05-18 17:09:33

标签: r regex text-extraction data-extraction

请提出任何建议。这是时间敏感的。我的PDF报告主要是文本块。它们是长篇报道(约50-100页)。我正在尝试编写一个R脚本,它能够使用开始/停止位置字符串提取这些PDF报告的特定部分。注意:报告的长度不同。简短的例子:

DOCUMENT TITLE
01. SECTION 1
This is a test section that I DONT want to extract. 
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text

02. SECTION 2
This is a test section that I do want to extract. 
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...

11. SECTION 11
This is a test section that I do want to extract. 
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...

12. SECTION 12
This is a test section that I DONT want to extract. 
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...

因此,本例中的目标是提取第2节下面的段落并将其存储为字段/数据点。我还想将第11节存储为字段/数据点。 请注意文档为PDF格式

我尝试过使用pdftools,tm,stringr,我花了20多个小时搜索解决方案和教程如何做到这一点。我知道这是可能的,因为我之前使用SAS做过......

请参阅下面的代码,我在问题中添加了评论。我相信RegEx将成为解决方案的一部分,但我迷失了。

# Init Step
libs <- c("tm","class","stringr","testthat",
          "pdftools")
lapply(libs, require, character.only= TRUE)


# File name & location
filename = "~/pdf_test/test.pdf"

# converting PDF to text
textFile <- pdf_text(filename) 

cat(textFile[1]) # Text of pg. 1 of PDF
cat(textFile[2]) # Text of pg. 2 of PDF

# I'm at a loss of how to parse the values I want. I have seen things 
  like:

sectionxyz <- str_extract_all(textFile, #??? )

rm_between()

# 1) How do I loop through each page of PDF file?
# 2) How do I identify start/stop positions for section to be extracted?
# 3) How do I add logic to extract text between start/stop positions
#    and then add the result to a data field?
# 4) Sections in PDF will be long sections of text (i.e. 100+ words into a field)

NEW ------

所以我能够: -Prep doc正确 - 识别正确的启动/停止模式:

length(grep("^11\\. LIMITS OF LIABILITY( +){1}$",source_main2)) length(grep("Applicable\\s+[Ll]imits\\s+[Oo]f",source_main2))

pat_st_lol <- "^11\\. LIMITS OF LIABILITY( +){1}$" pat_ed_lol <- "Applicable\\s+[Ll]imits\\s+[Oo]f"

length(grep())语句仅验证是否找到了1个实例。从这里开始,我很遗憾基于如何使用gsub或类似方法来提取我想要的数据部分。我试过了:

pat <- paste0(".*",pat_st_lol,"(.*)",pat_ed_lol,".*") test <- gsub(".*^11\\. LIMITS OF LIABILITY( +){1}$(.*)\n", "Applicable\\s+[Ll]imits\\s+[Oo]f", source_main2) test2 <-gsub(".*pat_st_lol(.*)\npat_ed_lol.*")

到目前为止,进展甚微,但无论如何都取得了进展。

1 个答案:

答案 0 :(得分:0)

如果您可以系统地识别您需要的部分,您可以按照指示使用正则表达式提取所需的文本。

在上面的示例中,gsub(".*SECTION 11(.*)\n12\\..*","\\1",string)之类的内容应该有用。

现在,您可以使用paste动态定义模式并迭代所有文件。然后,每个结果都可以保存在data.frame, list,...

以下是对该模式的简要更详细的解释:

  • 首先,.*是匹配&#34;任何&#34;的方式。如果您想匹配数字,可以使用\\d或等效[0-9]Here是R中Regex的简短介绍(我发现它非常有用),你可以找到几个字符类。
  • .*位于模式边缘意味着
  • 之前/之后可能有文字
  • (.*)表示我们想要的内容(因此,此处匹配使用.*的任何内容)。基本上它意味着提取物&#34;任何东西&#34;在SECTION 1112.
  • 之间
  • \\.表示点,\n是&#34;换行符&#34;元字符(如前所述&#34; 12。&#34;,新行开始)
  • 在Regex中,您可以使用括号在模式中创建分组,即gsub(".*(\\d{2}\\:\\d{2})", "\\1","18.05.2018, 21:37")将返回21:37,或gsub("([A-z]) \\d+","\\1","hello 123")将提供hello
  • 现在gsub中的第二个参数可以并且通常用于提供替代,即用匹配模式替换的东西。然而,在这里,我们不想要任何替代,我们想要提取一些东西。 \\1表示提取第一个分组,即它在第一个括号内的内容(您可以有多个分组)。
  • 最后,string是我们要提取的字符串,即PDF文件

现在,如果你想在循环中执行类似的操作,你可以执行以下操作:

# we are in the loop
# first is your starting point in the extraction, i.e. "SECTION 11"
# last is your end point, i.e. "12."

first <- "SECTION 11"   # first and last can be dynamically assigned 
last <- "12\\."         # "\\" is added before the dot as "." is a Regex metachar
# If last doesn't systematically contain a dot
# you could use gsub to add "\\" before the dot when needed:
# gsub("\\.","\\\\.",".") returns "\\."
# so gsub("\\.","\\\\.","12.") returns "12\\."

pat <- paste0(".*",first,"(.*)","\n",last,".*")  #"\n" is added to stop before the newline, but it could be omitted (then "\n" might appear in the extraction)
gsub(pat,"\\1",string) # returns the same as above