Question

请提出任何建议。这是时间敏感的。我的PDF报告主要是文本块。它们是长篇报道（约50-100页）。我正在尝试编写一个R脚本，它能够使用开始/停止位置字符串提取这些PDF报告的特定部分。注意：报告的长度不同。简短的例子：

DOCUMENT TITLE
01. SECTION 1
This is a test section that I DONT want to extract. 
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text

02. SECTION 2
This is a test section that I do want to extract. 
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...

11. SECTION 11
This is a test section that I do want to extract. 
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...

12. SECTION 12
This is a test section that I DONT want to extract. 
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...

因此，本例中的目标是提取第2节下面的段落并将其存储为字段/数据点。我还想将第11节存储为字段/数据点。 请注意文档为PDF格式

我尝试过使用pdftools，tm，stringr，我花了20多个小时搜索解决方案和教程如何做到这一点。我知道这是可能的，因为我之前使用SAS做过......

请参阅下面的代码，我在问题中添加了评论。我相信RegEx将成为解决方案的一部分，但我迷失了。

# Init Step
libs <- c("tm","class","stringr","testthat",
          "pdftools")
lapply(libs, require, character.only= TRUE)


# File name & location
filename = "~/pdf_test/test.pdf"

# converting PDF to text
textFile <- pdf_text(filename) 

cat(textFile[1]) # Text of pg. 1 of PDF
cat(textFile[2]) # Text of pg. 2 of PDF

# I'm at a loss of how to parse the values I want. I have seen things 
  like:

sectionxyz <- str_extract_all(textFile, #??? )

rm_between()

# 1) How do I loop through each page of PDF file?
# 2) How do I identify start/stop positions for section to be extracted?
# 3) How do I add logic to extract text between start/stop positions
#    and then add the result to a data field?
# 4) Sections in PDF will be long sections of text (i.e. 100+ words into a field)

NEW ------

所以我能够： -Prep doc正确 - 识别正确的启动/停止模式：

length(grep("^11\\. LIMITS OF LIABILITY( +){1}$",source_main2)) length(grep("Applicable\\s+[Ll]imits\\s+[Oo]f",source_main2))

pat_st_lol <- "^11\\. LIMITS OF LIABILITY( +){1}$" pat_ed_lol <- "Applicable\\s+[Ll]imits\\s+[Oo]f"

length(grep())语句仅验证是否找到了1个实例。从这里开始，我很遗憾基于如何使用gsub或类似方法来提取我想要的数据部分。我试过了：

pat <- paste0(".*",pat_st_lol,"(.*)",pat_ed_lol,".*") test <- gsub(".*^11\\. LIMITS OF LIABILITY( +){1}$(.*)\n", "Applicable\\s+[Ll]imits\\s+[Oo]f", source_main2) test2 <-gsub(".*pat_st_lol(.*)\npat_ed_lol.*")

到目前为止，进展甚微，但无论如何都取得了进展。

Answer 1

如果您可以系统地识别您需要的部分，您可以按照指示使用正则表达式提取所需的文本。

在上面的示例中，gsub(".*SECTION 11(.*)\n12\\..*","\\1",string)之类的内容应该有用。

现在，您可以使用paste动态定义模式并迭代所有文件。然后，每个结果都可以保存在data.frame, list,...。

中

以下是对该模式的简要更详细的解释：

首先，.*是匹配＆＃34;任何＆＃34;的方式。如果您想匹配数字，可以使用\\d或等效[0-9]。 Here是R中Regex的简短介绍（我发现它非常有用），你可以找到几个字符类。
.*位于模式边缘意味着
(.*)表示我们想要的内容（因此，此处匹配使用.*的任何内容）。基本上它意味着提取物＆＃34;任何东西＆＃34;在SECTION 11和12.
\\.表示点，\n是＆＃34;换行符＆＃34;元字符（如前所述＆＃34; 12。＆＃34;，新行开始）
在Regex中，您可以使用括号在模式中创建分组，即gsub(".*(\\d{2}\\:\\d{2})", "\\1","18.05.2018, 21:37")将返回21:37，或gsub("([A-z]) \\d+","\\1","hello 123")将提供hello。
现在gsub中的第二个参数可以并且通常用于提供替代，即用匹配模式替换的东西。然而，在这里，我们不想要任何替代，我们想要提取一些东西。 \\1表示提取第一个分组，即它在第一个括号内的内容（您可以有多个分组）。
最后，string是我们要提取的字符串，即PDF文件

现在，如果你想在循环中执行类似的操作，你可以执行以下操作：

# we are in the loop
# first is your starting point in the extraction, i.e. "SECTION 11"
# last is your end point, i.e. "12."

first <- "SECTION 11"   # first and last can be dynamically assigned 
last <- "12\\."         # "\\" is added before the dot as "." is a Regex metachar
# If last doesn't systematically contain a dot
# you could use gsub to add "\\" before the dot when needed:
# gsub("\\.","\\\\.",".") returns "\\."
# so gsub("\\.","\\\\.","12.") returns "12\\."

pat <- paste0(".*",first,"(.*)","\n",last,".*")  #"\n" is added to stop before the newline, but it could be omitted (then "\n" might appear in the extraction)
gsub(pat,"\\1",string) # returns the same as above

如何从PDF文件中提取特定文本的部分到R数据框？复杂

1 个答案: