如何解决:“重新错误:位置23457上没有重复的内容”?

时间:2019-04-24 13:28:14

标签: python regex

我试图通过以下方式在PDF文件目录中查找特定的字符串:首先使用此正则表达式在文档中搜索看起来像的所有字符串,然后将其与已知的现有列表进行比较字符串:

regex = "\\b(?:" + "|".join(symbols) + ")\\b"

如果我在程序本身中扫描示例文本,则该代码有效。但是,当我遍历PDF时,会得到re.error: nothing to repeat at position 23457。因此,似乎其中一个字符未正确转义,但我不知道是哪个字符。

这是我的代码:

import PyPDF2
import os
import re

symbols = ['CA', 'VVI', 'MAVP', 'EB', 'GM', 'FCA', 'LMB', 'BHF', 'PELP', 'QQCM', 'BACC', 'A', 'XXCX']

source_dir = '/Users/test/Desktop/PDFs'
for dir, subdir, files in os.walk(source_dir):
    for file in files:
        if file.endswith('.pdf'):
            file = os.path.join(dir, file)
            pdfFileObj = open(file, 'rb')
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
            num_pages = pdfReader.numPages

            count = 0
            text = " "

            while count < num_pages:
                pageObj = pdfReader.getPage(count)
                count += 1
                text += pageObj.extractText()

            print(file + " mentions the following symbols:")

            regex = "\\b(?:" + "|".join(symbols) + ")\\b"
            matches = re.findall(regex, text)
            print(matches)

跟踪:

Traceback (most recent call last):
  File "/Users/test/Desktop/Python/MSD/PDF_scrape_dir_regex.py", line 1280, in <module>
    matches = re.findall(regex, text)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 223, in findall
    return _compile(pattern, flags).findall(string)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 930, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 426, in _parse_sub
    not nested and not items))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 816, in _parse
    p = _parse_sub(source, state, sub_verbose, nested + 1)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 426, in _parse_sub
    not nested and not items))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 651, in _parse
    source.tell() - here + len(this))
re.error: nothing to repeat at position 23457

如果有帮助,请参考以下PDF文件:

1 个答案:

答案 0 :(得分:2)

您的某些符号包含在正则表达式中具有特殊含义的字符。 *+用于指示前面模式的重复。您的其中一个符号前面没有任何图案,因此没有重复的符号。如果您查看正则表达式的位置23457,应该会看到引起问题的符号。

在创建正则表达式时使用re.escape(),以按字面意义对待所有特殊字符。

regex = r"\b(?:" + "|".join(map(re.escape, symbols)) + r")\b"

您还应该在循环之前而不是每次循环都分配此变量,因为符号列表不会更改。而且,由于解析长的正则表达式很昂贵,因此您应该调用re.compile()仅执行一次。

regex = re.compile(r"\b(?:" + "|".join(map(re.escape, symbols)) + r")\b")

source_dir = '/Users/test/Desktop/PDFs'
for dir, subdir, files in os.walk(source_dir):
    for file in files:
        if file.endswith('.pdf'):
            file = os.path.join(dir, file)
            pdfFileObj = open(file, 'rb')
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
            num_pages = pdfReader.numPages

            count = 0
            text = " "

            while count < num_pages:
                pageObj = pdfReader.getPage(count)
                count += 1
                text += pageObj.extractText()

            print(file + " mentions the following symbols:")

            matches = regex.findall(text)
            print(matches)