所以除了使用tKintner的一些经验(一些GUI实验)之外,我是python的新手。
我读了.mbox文件并将字符串中的plain / text复制。本文包含注册表格。因此,住在伦敦Maple Street的Stefan为公司“MultiVendor XXVideos”工作,已经注册了一封电子邮件进行订阅。
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
我想获取这些数据,并在列中添加.csv行 “姓名”,“地址”,“公司”,......
现在我试图剪切切片。为了调试,我使用“print”(IDE = KATE / KDE + terminal ... :-D)。 问题是,数据在关键字后面包含多行,但我只得到第一行。
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
输出:
OFFICIAL_POSTAL_ADDRESS =20
这里缺少线条.. 来自档案:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
@Yaniv 谢谢你,我仍然试图了解每一步,但只是想发表评论。我喜欢使用list / matrix / vector“key_value_pairs”
的想法电子邮件中的关键字数量约为20个字。另外,我的值有时被“=”除线。 我想的是:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
也许来自EMAIL.mbox的HTML更容易处理?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
但是“=”仍在那里 我应该用“”替换[“=”,“=”]吗?
答案 0 :(得分:1)
我会选择&#34;例程&#34;解析输入行上的循环,并维护current_key
和current_value
变量,作为数据中某个键的值可能是& #34;恼人的&#34;,并分散在多条线上。
我已经在下面的代码中展示了这种解析方法,并对您的问题做了一些假设。例如,如果输入行以空格开头,我认为必须是这样的&#34;烦人的&#34;价值(分布在多条线上)。使用一些可配置的字符串(参数join_lines_using_this
)将这些行连接成单个值。另一个假设是你可能想要从键和值中去除空格。
随意调整代码以适应您对输入的假设,并在他们不持有时提出异常!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
使用示例:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
将输出:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]
答案 1 :(得分:-1)
您在评论中指出您的内容输入字符串应该相对一致。如果是这种情况,并且您希望能够将该字符串拆分为多行,最简单的方法是用空格替换\n
,然后解析单个字符串。
我故意限制我使用字符串方法的答案,而不是发明一个巨大的函数来做到这一点。原因:1)您的过程已经足够复杂了,2)您的问题实际上归结为如何跨多行处理字符串数据。如果是这种情况,并且模式是一致的,那么这将完成这项工作
content = content.replace('\n', ' ')
然后,您可以拆分一致结构化标题中的每个边界。
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
通常,我会建议正则表达式。 (https://docs.python.org/3.4/library/re.html)。从长远来看,如果你需要再做一次这样的事情,正则表达式将按时支付红利数据。要使正则表达式函数“切割”多行,您可以使用re.MULTILINE
选项。所以最终可能看起来像re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)