我试图通过以下方式在PDF文件目录中查找特定的字符串:首先使用此正则表达式在文档中搜索看起来像的所有字符串,然后将其与已知的现有列表进行比较字符串:
regex = "\\b(?:" + "|".join(symbols) + ")\\b"
如果我在程序本身中扫描示例文本,则该代码有效。但是,当我遍历PDF时,会得到re.error: nothing to repeat at position 23457
。因此,似乎其中一个字符未正确转义,但我不知道是哪个字符。
这是我的代码:
import PyPDF2
import os
import re
symbols = ['CA', 'VVI', 'MAVP', 'EB', 'GM', 'FCA', 'LMB', 'BHF', 'PELP', 'QQCM', 'BACC', 'A', 'XXCX']
source_dir = '/Users/test/Desktop/PDFs'
for dir, subdir, files in os.walk(source_dir):
for file in files:
if file.endswith('.pdf'):
file = os.path.join(dir, file)
pdfFileObj = open(file, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = " "
while count < num_pages:
pageObj = pdfReader.getPage(count)
count += 1
text += pageObj.extractText()
print(file + " mentions the following symbols:")
regex = "\\b(?:" + "|".join(symbols) + ")\\b"
matches = re.findall(regex, text)
print(matches)
跟踪:
Traceback (most recent call last):
File "/Users/test/Desktop/Python/MSD/PDF_scrape_dir_regex.py", line 1280, in <module>
matches = re.findall(regex, text)
File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 223, in findall
return _compile(pattern, flags).findall(string)
File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 286, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 930, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 426, in _parse_sub
not nested and not items))
File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 816, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 426, in _parse_sub
not nested and not items))
File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 651, in _parse
source.tell() - here + len(this))
re.error: nothing to repeat at position 23457
如果有帮助,请参考以下PDF文件:
答案 0 :(得分:2)
您的某些符号包含在正则表达式中具有特殊含义的字符。 *
和+
用于指示前面模式的重复。您的其中一个符号前面没有任何图案,因此没有重复的符号。如果您查看正则表达式的位置23457,应该会看到引起问题的符号。
在创建正则表达式时使用re.escape()
,以按字面意义对待所有特殊字符。
regex = r"\b(?:" + "|".join(map(re.escape, symbols)) + r")\b"
您还应该在循环之前而不是每次循环都分配此变量,因为符号列表不会更改。而且,由于解析长的正则表达式很昂贵,因此您应该调用re.compile()
仅执行一次。
regex = re.compile(r"\b(?:" + "|".join(map(re.escape, symbols)) + r")\b")
source_dir = '/Users/test/Desktop/PDFs'
for dir, subdir, files in os.walk(source_dir):
for file in files:
if file.endswith('.pdf'):
file = os.path.join(dir, file)
pdfFileObj = open(file, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = " "
while count < num_pages:
pageObj = pdfReader.getPage(count)
count += 1
text += pageObj.extractText()
print(file + " mentions the following symbols:")
matches = regex.findall(text)
print(matches)