我正在尝试遍历pdf来从电子邮件中提取信息。当我在单个示例中尝试它们时,我的各个regex语句会起作用,但是,当我尝试将所有代码放到一个for循环中以一次遍历多个pdf时,我无法追加到我的汇总df中(我目前只是创建一个空的df)。我需要使用try / except,因为并非所有电子邮件都具有所有字段(例如,某些电子邮件没有“附件”字段)。下面是我到目前为止编写的代码:
import os
import pandas as pd
pd.options.display.max_rows=999
import numpy
from numpy import NaN
from tika import parser
root = r"my_dir"
agg_df = pd.DataFrame()
for directory, subdirectory, files in os.walk(root):
for file in files:
filepath = os.path.join(directory, file)
print(file)
raw = parser.from_file(filepath)
img = raw['content']
img = img.replace('\n', '')
try:
from_field = re.search(r'From:(.*?)Sent:', img).group(1)
except:
pass
try:
sent_field = re.search(r'Sent:(.*?)To:', img).group(1)
except:
pass
try:
to_field = re.search(r'To:(.*?)Cc:', img).group(1)
except:
pass
try:
cc_field = re.search(r'Cc:(.*?)Subject:', img).group(1)
except:
pass
try:
subject_field = re.search(r'Subject:(.*?)Attachments:', img).group(1)
except:
pass
try:
attachments_field = re.search(r'Attachments:(.*?)NOTICE', img).group(1)
except:
pass
img_df = pd.DataFrame(columns=['From', 'Sent', 'To',
'Cc', 'Subject', 'Attachments'])
img_df['From'] = from_field
img_df['Sent'] = sent_field
img_df['To'] = to_field
img_df['Cc'] = cc_field
img_df['Subject'] = subject_field
img_df['Attachments'] = attachments_field
agg_df = agg_df.append(img_df)
答案 0 :(得分:0)
有两件事:
例如
from collections import defaultdict
data = defaultdict(list)
for directory, _, files in os.walk(root):
for file in files:
filepath = os.path.join(directory, file)
print(file)
raw = parser.from_file(filepath)
img = raw['content']
img = img.replace('\n', '')
from_match = re.search(r'From:(.*?)Sent:', img)
if not from_match:
sent_by = None
else:
sent_by = from_match.group(1)
data["from"].append(sent_by)
to_match = re.search(r'Sent:(.*?)To:', img)
if not to_match:
sent_to = None
else:
sent_to = to_match.group(1)
data["to"].append(sent_to)
# All your other regexes
df = pd.DataFrame(data)
此外,如果您要处理大量文件,则应考虑使用compiled expression。