Question

所以我的问题不是关于pdf提取。假设这是pdf文本摘录

（a）这是我的第一段，是一些垃圾文字

（b）这是另一段，但附带提及了另一段，该段引用了第945（d）条。

（c）这又是第三段

现在，我正在尝试创建一个包含3个值的列表，每个值代表一个段落。

import re
entire_text = """(a) This is my first paragraph, which is some junk text

(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d) somewhere within this text

(c) This again is is some third paragraph"""
PDF_SUB_SECTIONS = ["(a) ", "(b) ", "(c) ", "(d) ", "(e) ", "(f) ", "(g) "]
regexPattern = '|'.join(map(re.escape,PDF_SUB_SECTIONS))
glSubSections = re.split(regexPattern, entire_text)

我期望的是 [“这是我的第一段，是一些垃圾文字， “这是另一段，但附带提及了另一段，该段引用了本文中某处的第945（d）条”， “这又是第三段”]

我得到的是 [“这是我的第一段，是一些垃圾文字， “这是另一段，但附带提到了引用第945条的另一段”， “本文中的某处”， “这又是第三段”]

更多信息： 1）第945（d）条-这样的“ 945”（或任何文字）与“（d” 2）我正在使用PyPDF2提取上面的文本

Answer 1

有几种使用正则表达式执行此操作的方法，但是通常它会变得更复杂，可能不是最好的方法。例如，表达式类似于：

^(?:\([^)]+\))\s*(.*)

使用`re.findall`

进行测试

import re

regex = r"^(?:\([^)]+\))\s*(.*)"

test_str = ("(a) This is my first paragraph, which is some junk text\n\n"
    "(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)\n\n"
    "(c) This again is is some third paragraph")

print(re.findall(regex, test_str, re.MULTILINE))

输出

['This is my first paragraph, which is some junk text', 'This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)', 'This again is is some third paragraph']

使用`re.sub`

进行测试

import re

regex = r"^(?:\([^)]+\))\s*(.*)"

test_str = ("(a) This is my first paragraph, which is some junk text\n\n"
    "(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)\n\n"
    "(c) This again is is some third paragraph")

subst = "\\1"

print(re.sub(regex, subst, test_str, 0, re.MULTILINE))

使用`re.finditer`

进行测试

import re

regex = r"^(?:\([^)]+\))\s*(.*)"

test_str = ("(a) This is my first paragraph, which is some junk text\n\n"
    "(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)\n\n"
    "(c) This again is is some third paragraph")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

在this demo的右上角对表达式进行了说明，如果您想探索/简化/修改它，在this link中，您可以观察它如何与某些示例输入步骤匹配一步一步，如果您喜欢。

RegEx电路

jex.im可视化正则表达式：

Answer 2

pattern = r'^\([a-z]\)'
re.split(pattern, entire_text, flags=re.MULTILINE)

这将起作用，但是结果列表的第一个元素将是一个空字符串。它比其他解决方案要简单一些。我们将行的开头与^匹配，但为了使其在跨越多行的字符串中起作用，必须将re.MULTILINE标志传递给re.split。如果要忽略该错误的第一个元素，只需像这样re.split(pattern, entire_text, flags=re.MULTILINE)[1:]那样在结果列表上使用一个切片。

有关该re.MULTILINE的更多信息，请参见the docs

如何按数字拆分pdf文本

2 个答案:

使用`re.findall`

输出

使用`re.sub`

使用`re.finditer`

RegEx电路

如何按数字拆分pdf文本

2 个答案:

使用re.findall

输出

使用re.sub

使用re.finditer

RegEx电路

使用`re.findall`

使用`re.sub`

使用`re.finditer`