我试图查找整台计算机上所有csv文件中的所有电话号码,无论格式差异如何(在合理范围内)。我希望尽可能地加快速度,并提供一般的代码帮助(我只是按照我的方式进行此操作)。我切换到了formic而不是glob,甚至不知道我是否正确使用它(尤其是使用Windows'基于反斜杠的文件系统)。对于上下文,我有数千个文件,其中一些有数百万个电话号码。
这就是我正在尝试的:
import csv
import re
from formic import FileSet
with open('MegaList.csv', 'wb') as out:
seen = set()
regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
csv_files = FileSet(directory='C:\\', include='**\\*.csv')
out_writer = csv.writer(out)
out_writer.writerow([])
for filename in csv_files:
with open(filename, 'rbU') as ifile:
read = csv.reader(ifile)
try: #some files have row return NULL byte, so skip errors
for row in read:
for column in row:
s1 = column.strip()
if 9 < len(column) < 15: #does this make it faster or slower?
match = regex.search(s1)
if match:
canonical_phone = re.sub(r'\D', '', match.group(0))
if canonical_phone not in seen:
seen.add(canonical_phone)
except:
pass
for val in seen:
out_writer.writerow([val])