大家早上好
我现在正在上Python类,我们还没有涵盖我要问的问题。因此,任何帮助都会很棒。我有一个Python脚本,可以从文档中解析出电子邮件,但是它一次只允许我做一个文档。我大约有500份文件,其中大多数包含电子邮件地址。我想知道是否可以更改此脚本以读取所有子文件夹和文档,并跳过任何错误(如果有)。我了解有些文件类型可能无法读取。某些常见的文件类型为.txt,.csv,.sql,.xlsx。
这是我找到的脚本,它一次可以很好地处理一个文件。一如既往地感谢大家的帮助。
#!/usr/bin/env python
#
# Extracts email addresses from one or more plain text files.
#
# Notes:
# - Does not save to file (pipe the output to a file if you want it saved).
# - Does not check for duplicates (which can easily be done in the terminal).
#
from optparse import OptionParser
import os.path
import re
regex = re.compile(("([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
"{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
"\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))
def file_to_str(filename):
"""Returns the contents of filename as a string."""
with open(filename) as f:
return f.read().lower() # Case is lowered to prevent regex mismatches.
def get_emails(s):
"""Returns an iterator of matched emails found in string s."""
# Removing lines that start with '//' because the regular expression
# mistakenly matches patterns like 'http://foo@bar.com' as '//foo@bar.com'.
return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))
if __name__ == '__main__':
parser = OptionParser(usage="Usage: python %prog [FILE]...")
# No options added yet. Add them here if you ever need them.
options, args = parser.parse_args()
if not args:
parser.print_usage()
exit(1)
for arg in args:
if os.path.isfile(arg):
for email in get_emails(file_to_str(arg)):
print email
else:
print '"{}" is not a file.'.format(arg)
parser.print_usage()
答案 0 :(得分:1)
您可以使用os.walk
遍历所有子目录:
import os
if __name__ == '__main__':
parser = OptionParser(usage="Usage: python %prog [DIRECTORIES]...")
# No options added yet. Add them here if you ever need them.
options, args = parser.parse_args()
if not args:
parser.print_usage()
exit(1)
for dir in args:
for root, _, files in os.walk(dir):
for file in files:
if any(file.endswith(ext) for ext in ('.txt', '.csv', '.sql', '.xlsx')):
for email in get_emails(file_to_str(os.path.join(root, file))):
print(email)
答案 1 :(得分:1)
您可以像这样使用os.walk
:
not_parseble_files = ['.txt', '.csv']
for root, dirs, files in os.walk(root_folder):#This recursively searches all sub directories for files
for file in files:
_,file_ext = os.path.splitext(file)#Here we get the extension of the file
file_path = os.path.join(root,file)
if file_ext in not_parseble_files:#We make sure the extension is not in the banned list 'not_parseble_files'
print("File %s is not parseble"%file_path)
continue #This one continues the loop to the next file
if os.path.isfile(file_path):
for email in get_emails(file_to_str(file_path)):
print(email)