Question

我试图通过以下方式在PDF文件目录中查找特定的字符串：首先使用此正则表达式在文档中搜索看起来像的所有字符串，然后将其与已知的现有列表进行比较字符串：

regex = "\\b(?:" + "|".join(symbols) + ")\\b"

如果我在程序本身中扫描示例文本，则该代码有效。但是，当我遍历PDF时，会得到re.error: nothing to repeat at position 23457。因此，似乎其中一个字符未正确转义，但我不知道是哪个字符。

这是我的代码：

import PyPDF2
import os
import re

symbols = ['CA', 'VVI', 'MAVP', 'EB', 'GM', 'FCA', 'LMB', 'BHF', 'PELP', 'QQCM', 'BACC', 'A', 'XXCX']

source_dir = '/Users/test/Desktop/PDFs'
for dir, subdir, files in os.walk(source_dir):
    for file in files:
        if file.endswith('.pdf'):
            file = os.path.join(dir, file)
            pdfFileObj = open(file, 'rb')
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
            num_pages = pdfReader.numPages

            count = 0
            text = " "

            while count < num_pages:
                pageObj = pdfReader.getPage(count)
                count += 1
                text += pageObj.extractText()

            print(file + " mentions the following symbols:")

            regex = "\\b(?:" + "|".join(symbols) + ")\\b"
            matches = re.findall(regex, text)
            print(matches)

跟踪：

Traceback (most recent call last):
  File "/Users/test/Desktop/Python/MSD/PDF_scrape_dir_regex.py", line 1280, in <module>
    matches = re.findall(regex, text)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 223, in findall
    return _compile(pattern, flags).findall(string)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 930, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 426, in _parse_sub
    not nested and not items))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 816, in _parse
    p = _parse_sub(source, state, sub_verbose, nested + 1)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 426, in _parse_sub
    not nested and not items))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 651, in _parse
    source.tell() - here + len(this))
re.error: nothing to repeat at position 23457

如果有帮助，请参考以下PDF文件：

Answer 1

您的某些符号包含在正则表达式中具有特殊含义的字符。 *和+用于指示前面模式的重复。您的其中一个符号前面没有任何图案，因此没有重复的符号。如果您查看正则表达式的位置23457，应该会看到引起问题的符号。

在创建正则表达式时使用re.escape()，以按字面意义对待所有特殊字符。

regex = r"\b(?:" + "|".join(map(re.escape, symbols)) + r")\b"

您还应该在循环之前而不是每次循环都分配此变量，因为符号列表不会更改。而且，由于解析长的正则表达式很昂贵，因此您应该调用re.compile()仅执行一次。

regex = re.compile(r"\b(?:" + "|".join(map(re.escape, symbols)) + r")\b")

source_dir = '/Users/test/Desktop/PDFs'
for dir, subdir, files in os.walk(source_dir):
    for file in files:
        if file.endswith('.pdf'):
            file = os.path.join(dir, file)
            pdfFileObj = open(file, 'rb')
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
            num_pages = pdfReader.numPages

            count = 0
            text = " "

            while count < num_pages:
                pageObj = pdfReader.getPage(count)
                count += 1
                text += pageObj.extractText()

            print(file + " mentions the following symbols:")

            matches = regex.findall(text)
            print(matches)

如何解决：“重新错误：位置23457上没有重复的内容”？

1 个答案: