清洗电子邮件链以进行文本分析python

时间:2018-08-03 15:39:28

标签: python text

我有一些文字:

text = """From: 'Mark Twain' <mark.twain@gmail.com>
To: 'Edgar Allen Poe' <eap@gmail.com>
Subject: RE:Hello!

Ed,

I just read the Tell Tale Heart. You\'ve got problems man.

Sincerely,
Marky Mark

From: 'Edgar Allen Poe' <eap@gmail.com>
To: 'Mark Twain' <mark.twain@gmail.com>
Subject: RE: Hello!

Mark,

The world is crushing my soul, and so are you.

Regards,
Edgar"""

看起来像这样:

"From: 'Mark Twain' <mark.twain@gmail.com>\nTo: 'Edgar Allen Poe' <eap@gmail.com>\nSubject: RE:Hello!\n\nEd,\n\nI just read the Tell Tale Heart. You've got problems man.\n\nSincerely,\nMarky Mark\n\nFrom: 'Edgar Allen Poe' <eap@gmail.com>\nTo: 'Mark Twain' <mark.twain@gmail.com>\nSubject: RE: Hello!\n\nMark,\n\nThe world is crushing my soul, and so are you.\n\nRegards,\nEdgar"

我正在尝试解析其中的消息。最终,我希望有一个列表或字典,其中有“从”和“到”,然后是要进行分析的消息正文。

我试图通过将所有内容调低然后进行字符串拆分来解析它。

text = text.lower()
text = text.translate(string.punctuation)
text_list = text.split('+')
text_list = [x for x in text_list if len(x) != 0]

有更好的方法吗?

3 个答案:

答案 0 :(得分:3)

str.translate的工作方式不是这样。您的text.translate(string.punctuation)使用标点符号字符作为转换表,因此它将“ \ n”(即代码点10)映射到string.punctuation中的第10个字符(即“ +”)。使用str.translate的通常方法是先使用str.maketrans创建一个转换表,该表使您可以指定要映射的字符,要映射的对应字符以及(可选)要删除的字符。如果您只想使用它进行删除,则可以使用dict.fromkeys创建表,例如

table = dict.fromkeys([ord(c) for c in string.punctuation])

做出决定,将string.punctuation中每个字符的代码点与None相关联。

这是您的代码的修复版本,它使用str.translate一步执行大小写转换和标点删除。

# Map upper case to lower case & remove punctuation
table = str.maketrans(string.ascii_uppercase, 
    string.ascii_lowercase, string.punctuation)

text = text.translate(table)
text_list = text.split('\n')
for row in text_list:
    print(repr(row))

输出

'from mark twain marktwaingmailcom'
'to edgar allen poe eapgmailcom'
'subject rehello'
''
'ed'
''
'i just read the tell tale heart youve got problems man'
''
'sincerely'
'marky mark'
''
'from edgar allen poe eapgmailcom'
'to mark twain marktwaingmailcom'
'subject re hello'
''
'mark'
''
'the world is crushing my soul and so are you'
''
'regards'
'edgar'

但是,简单地删除所有标点符号有点麻烦,因为它连接了一些您可能不想连接的单词。相反,我们可以将每个标点符号转换为一个空格,然后在空白处进行分割:

# Map all punctuation to space
table = dict.fromkeys([ord(c) for c in string.punctuation], ' ')
text = text.translate(table).lower()
text_list = text.split()
print(text_list)

输出

['from', 'mark', 'twain', 'mark', 'twain', 'gmail', 'com', 'to', 'edgar', 'allen', 'poe', 'eap', 'gmail', 'com', 'subject', 're', 'hello', 'ed', 'i', 'just', 'read', 'the', 'tell', 'tale', 'heart', 'you', 've', 'got', 'problems', 'man', 'sincerely', 'marky', 'mark', 'from', 'edgar', 'allen', 'poe', 'eap', 'gmail', 'com', 'to', 'mark', 'twain', 'mark', 'twain', 'gmail', 'com', 'subject', 're', 'hello', 'mark', 'the', 'world', 'is', 'crushing', 'my', 'soul', 'and', 'so', 'are', 'you', 'regards', 'edgar']

答案 1 :(得分:2)

如果您只想解析包含标准格式电子邮件的字符串,请使用email.parser module;它是标准库的一部分。

您仍然需要将电子邮件以较大的文本分开,但是From: ...标头可以使用正则表达式来帮助您解决此问题:

import re
from email import parser, policy

email_start = re.compile(r'(?<=\n)\n(?=From:\s+)')

parser = parser.Parser(policy=policy.default)

for email_text in email_start.split(text):
    message = parser.parsestr(email_text)
    to, from_ = message['to'], message['from']
    body = message.get_payload()
    # do something with the email details

正则表达式与任何换行符相匹配,该换行符后面直接有另一个换行符(因此有一个空行),后跟文本From:和至少一个空格(因此下一行看起来像一封电子邮件{ {1}}标头)。

即使正确使用工具,通过删除或替换标点符号来尝试获得相同部分也不是获得相同信息的非常有效的方法。

演示:

From:

答案 2 :(得分:1)

您可以使用re拆分邮件(explanation of this regexp on external site)。结果是带有键'from''to''subject''message'的字典列表:

text = """From: 'Mark Twain' <mark.twain@gmail.com>
To: 'Edgar Allen Poe' <eap@gmail.com>
Subject: RE:Hello!

Ed,

I just read the Tell Tale Heart. You\'ve got problems man.

Sincerely,
Marky Mark

From: 'Edgar Allen Poe' <eap@gmail.com>
To: 'Mark Twain' <mark.twain@gmail.com>
Subject: RE: Hello!

Mark,

The world is crushing my soul, and so are you.

Regards,
Edgar"""

import re
from pprint import pprint

groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text, flags=re.DOTALL|re.M)
emails = []
for g in groups:
    d = {}
    d['from'] = g[0].strip()
    d['to'] = g[1].strip()
    d['subject'] = g[2].strip()
    d['message'] = g[3].strip()
    emails.append(d)

pprint(emails)

打印:

[{'from': "'Mark Twain' <mark.twain@gmail.com>",
  'message': 'Ed,\n'
             '\n'
             "I just read the Tell Tale Heart. You've got problems man.\n"
             '\n'
             'Sincerely,\n'
             'Marky Mark',
  'subject': 'RE:Hello!',
  'to': "'Edgar Allen Poe' <eap@gmail.com>"},
 {'from': "'Edgar Allen Poe' <eap@gmail.com>",
  'message': 'Mark,\n'
             '\n'
             'The world is crushing my soul, and so are you.\n'
             '\n'
             'Regards,\n'
             'Edgar',
  'subject': 'RE: Hello!',
  'to': "'Mark Twain' <mark.twain@gmail.com>"}]