我想从json文件中提取垃圾邮件的主题,但是主题可以在文件中的任何位置,在“内容”或“标题”或“正文”内。使用正则表达式,尽管使用下面的代码,我也无法提取主题:有人可以指出以下正则表达式或代码中的错误之处吗?
import re
import json
with open("test.json", 'r') as fp:
json_decode = json.loads(fp.read())
p = re.compile('([\[\(] *)?.*(RE?S?|FWD?|re\[\d+\]?) *([-:;)\]][ :;\])-]*|$)|\]+ *$', re.IGNORECASE)
for line in json_decode:
print(p.sub('', line).strip())
输出(不正确):主体
我的test.json文件是这样的:
{'attachment': [{'content_header': {'content-disposition': ['attachment; '
'filename="image006.jpg"'],
'content-id': ['<image006.jpg@01D35D21.756FEE10>']
'body': [{'content': ' \n'
' \n'
'From: eCard Delivery [mailto:ecards@789greeting.com] \n'
'Sent: Monday, November 13, 2017 9:14 AM\n'
'To: Zhang, Jerry (352A-Affiliate) '
'Subject: Warmest Wishes! You have a Happy Thanksgiving '
'ecard delivery!\n'
' \n'
' \tDear Jerry,\n'
'header': {'date': '2017-11-14T08:20:42-08:00',
'header': {'accept-language': ['en-US'],
'content-language': ['en-US'],
'content-type': ['multipart/mixed; '
'boundary="--boundary-LibPST-iamunique-1500317751_-_-"'],
'date': ['Tue, 14 Nov 2017 08:20:42 -0800']
'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving '
'ecard delivery!'}}
^上面是json文件的正确格式。
答案 0 :(得分:0)
好的-现在,鉴于您的原始JSON文件可能不包含newline characters
的事实,我希望这种方法行之有效,甚至可能更准确
>>> string = '''{'attachment': [{'content_header': {'content-disposition': ['attachment; ''filename="image006.jpg"'],'content-id': ['<image006.jpg@01D35D21.756FEE10>'] 'body': [{'content': ' '' ''From: eCard Delivery [mailto:ecards@789greeting.com] ''Sent: Monday, November 13, 2017 9:14 AM''To: Zhang, Jerry (352A-Affiliate) ''Subject: Warmest Wishes! You have a Happy Thanksgiving ''ecard delivery!'' '' Dear Jerry,' 'header': {'date': '2017-11-14T08:20:42-08:00','header': {'accept-language': ['en-US'], 'content-language': ['en-US'], 'content-type': ['multipart/mixed; ''boundary="--boundary-LibPST-iamunique-1500317751_-_-"'], 'date': ['Tue, 14 Nov 2017 08:20:42 -0800'] 'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving ' 'ecard delivery!'}}'''
>>> subjects_test = re.findall('([\'|\"]*[\S]ubject[\S\s]+?[\'|\"]+)(?=\n|$|\s|\})', string)
>>> for subject in subjects_test:
print(subject)
#OUPUT: #Kind of off I guess, but I don't know the full format of the file so this is the safest bet
''Subject: Warmest Wishes! You have a Happy Thanksgiving ''ecard delivery!''
'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving '
编辑-使用上面提供的字符串,在下面给出您的评论。希望我了解您的要求。我使用了我提供的两个正则表达式示例。
>>> string = '''{'attachment': [{'content_header': {'content-disposition': ['attachment; '
'filename="image006.jpg"'],
'content-id': ['<image006.jpg@01D35D21.756FEE10>']
'body': [{'content': ' \n'
' \n'
'From: eCard Delivery [mailto:ecards@789greeting.com] \n'
'Sent: Monday, November 13, 2017 9:14 AM\n'
'To: Zhang, Jerry (352A-Affiliate) '
'Subject: Warmest Wishes! You have a Happy Thanksgiving '
'ecard delivery!\n'
' \n'
' \tDear Jerry,\n'
'header': {'date': '2017-11-14T08:20:42-08:00',
'header': {'accept-language': ['en-US'],
'content-language': ['en-US'],
'content-type': ['multipart/mixed; '
'boundary="--boundary-LibPST-iamunique-1500317751_-_-"'],
'date': ['Tue, 14 Nov 2017 08:20:42 -0800']
'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving '
'ecard delivery!'}}'''
>>> subjects_test_1 = re.findall('([\'\"]*[S|s]ubject[:\s]*?(?:[\'|\"]*[\S\s]*?(?=[\'|\"])*))(?=\n|$)', string)
>>> for subject in subjects_test_1:
print(subject)
#OUPUT:
'Subject: Warmest Wishes! You have a Happy Thanksgiving '
'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving '
########################################################
>>> subjects_test_2 = re.findall('([\'|\"]*[\S]ubject[\S\s]+?[\'|\"]*)(?=\n|$)', string)
>>> for subject in subjects_test_2:
print(subject)
#OUPUT:
'Subject: Warmest Wishes! You have a Happy Thanksgiving '
'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving '
。
或尝试使用此功能:
对于调用函数的行,将'PATH_TO_YOUR_FILE'
替换为...,您知道文件的路径...
>>> def email_subject_parse(file_path):
import re
email_subjects = []
try:
with open(file_path) as file:
string = file.read()
email_subjects = re.findall('([\'\"]*[S|s]ubject[:\s]*?(?:[\'|\"]*[\S\s]*?(?=[\'|\"])*))(?=\n|$)', string)
#Or less complicated
#email_subjects = re.findall('([\'|\"]*[\S]ubject[\S\s]+?[\'|\"]*)(?=\n|$)', string)
return email_subjects
except:
print('You have likely provided a bad file path')
>>> subjects = email_subject_parse('PATH_TO_YOUR_FILE')
>>> for subject in subjects:
print(subject)
#OUPUT:
'Subject: Warmest Wishes! You have a Happy Thanksgiving '
'subject': 'FW: Warmest Wishes! You have a Happy Thanksgiving '