如何按数字拆分pdf文本

时间:2019-07-11 19:53:13

标签: python regex pdf

所以我的问题不是关于pdf提取。 假设这是pdf文本摘录

(a)这是我的第一段,是一些垃圾文字

(b)这是另一段,但附带提及了另一段,该段引用了第945(d)条。

(c)这又是第三段

现在,我正在尝试创建一个包含3个值的列表,每个值代表一个段落。

import re
entire_text = """(a) This is my first paragraph, which is some junk text

(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d) somewhere within this text

(c) This again is is some third paragraph"""
PDF_SUB_SECTIONS = ["(a) ", "(b) ", "(c) ", "(d) ", "(e) ", "(f) ", "(g) "]
regexPattern = '|'.join(map(re.escape,PDF_SUB_SECTIONS))
glSubSections = re.split(regexPattern, entire_text)

我期望的是 [“这是我的第一段,是一些垃圾文字, “这是另一段,但附带提及了另一段,该段引用了本文中某处的第945(d)条”, “这又是第三段”]

我得到的是 [“这是我的第一段,是一些垃圾文字, “这是另一段,但附带提到了引用第945条的另一段”, “本文中的某处”, “这又是第三段”]

更多信息: 1)第945(d)条-这样的“ 945”(或任何文字)与“(d” 2)我正在使用PyPDF2提取上面的文本

2 个答案:

答案 0 :(得分:1)

有几种使用正则表达式执行此操作的方法,但是通常它会变得更复杂,可能不是最好的方法。例如,表达式类似于:

^(?:\([^)]+\))\s*(.*)

使用re.findall

进行测试
import re

regex = r"^(?:\([^)]+\))\s*(.*)"

test_str = ("(a) This is my first paragraph, which is some junk text\n\n"
    "(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)\n\n"
    "(c) This again is is some third paragraph")

print(re.findall(regex, test_str, re.MULTILINE))

输出

['This is my first paragraph, which is some junk text', 'This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)', 'This again is is some third paragraph']

使用re.sub

进行测试
import re

regex = r"^(?:\([^)]+\))\s*(.*)"

test_str = ("(a) This is my first paragraph, which is some junk text\n\n"
    "(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)\n\n"
    "(c) This again is is some third paragraph")

subst = "\\1"

print(re.sub(regex, subst, test_str, 0, re.MULTILINE))

使用re.finditer

进行测试
import re

regex = r"^(?:\([^)]+\))\s*(.*)"

test_str = ("(a) This is my first paragraph, which is some junk text\n\n"
    "(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)\n\n"
    "(c) This again is is some third paragraph")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

this demo的右上角对表达式进行了说明,如果您想探索/简化/修改它,在this link中,您可以观察它如何与某些示例输入步骤匹配一步一步,如果您喜欢。

RegEx电路

jex.im可视化正则表达式:

enter image description here

答案 1 :(得分:0)

pattern = r'^\([a-z]\)'
re.split(pattern, entire_text, flags=re.MULTILINE)

这将起作用,但是结果列表的第一个元素将是一个空字符串。它比其他解决方案要简单一些。我们将行的开头与^匹配,但为了使其在跨越多行的字符串中起作用,必须将re.MULTILINE标志传递给re.split。如果要忽略该错误的第一个元素,只需像这样re.split(pattern, entire_text, flags=re.MULTILINE)[1:]那样在结果列表上使用一个切片。

有关该re.MULTILINE的更多信息,请参见the docs