Question

所以除了使用tKintner的一些经验（一些GUI实验）之外，我是python的新手。

我读了.mbox文件并将字符串中的plain / text复制。本文包含注册表格。因此，住在伦敦Maple Street的Stefan为公司“MultiVendor XXVideos”工作，已经注册了一封电子邮件进行订阅。

Name_OF_Person: Stefan
Adress_HOME: London, Maple
    Street
 45
Company_NAME: MultiVendor
XXVideos

我想获取这些数据，并在列中添加.csv行 “姓名”，“地址”，“公司”，......

现在我试图剪切切片。为了调试，我使用“print”（IDE = KATE / KDE + terminal ... :-D）。问题是，数据在关键字后面包含多行，但我只得到第一行。

您如何改进我的代码？

import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"

if __name__ == "__main__":
 with open(export_file_name,"w") as csvfile:
 writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
 writer.writeheader()

 for message in mailbox.mbox(mbox_file):
   if message.is_multipart():
     content = '\n'.join(part.get_payload() for part in message.get_payload())
     content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
     #print content
   else:
     content = message.get_payload()
   idea = message['message-id']
   sub =  message['subject']
   fr = message['from']
   date = message['date']
   writer.writerow ('ID':idea,......) # CSV writing will work fine

   for line in content.splitlines():
     line = line.strip()
      for pose in searchKeys: 
       if pose in line: 
         tmp = line.split(pose)
         pmt = tmp[1].split(":")[1]
         if next in line !=: 
         print pose +"\t"+pmt
       sleep(1)
csvfile.closed

输出：

OFFICIAL_POSTAL_ADDRESS  =20

这里缺少线条.. 来自档案：

OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41

EDIT2：

@Yaniv 谢谢你，我仍然试图了解每一步，但只是想发表评论。我喜欢使用list / matrix / vector“key_value_pairs”

的想法

电子邮件中的关键字数量约为20个字。另外，我的值有时被“=”除线。我想的是：

Search text for Keyword A, 
if true: 
 search text from Keyword A until keyword B 
 if true:
  copy text after A until B

Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
 45
Company_NAME: MultiVendor
XXVideos

也许来自EMAIL.mbox的HTML更容易处理？

<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
 E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font    
 fac=e=3D"Verdana" size=3D"1">Stefan&nbsp;</font></td></tr>

但是“=”仍在那里我应该用“”替换[“=”，“=”]吗？

Answer 1

我会选择＆＃34;例程＆＃34;解析输入行上的循环，并维护current_key和current_value变量，作为数据中某个键的值可能是＆＃34;恼人的＆＃34;，并分散在多条线上。

我已经在下面的代码中展示了这种解析方法，并对您的问题做了一些假设。例如，如果输入行以空格开头，我认为必须是这样的＆＃34;烦人的＆＃34;价值（分布在多条线上）。使用一些可配置的字符串（参数join_lines_using_this）将这些行连接成单个值。另一个假设是你可能想要从键和值中去除空格。

随意调整代码以适应您对输入的假设，并在他们不持有时提出异常！

# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):

    key_value_pairs = []

    current_key, current_value = None, ""
    for line in text.splitlines():
        line_split = line.split(':')
        if line.startswith(" ") or len(line_split) == 1:
            if current_key is None:
                raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
            current_value += join_lines_using_this + line.strip()
        else:
            if current_key is not None:
                key_value_pairs.append((current_key, current_value))
                current_key, current_value = None, ""
            current_key = line_split[0].strip()
            # We've just found a new key, so here you might want to perform additional checks,
            # e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
            current_value = ':'.join(line_split[1:]).strip()

    # Don't forget the last parsed key, value
    if current_key is not None:
        key_value_pairs.append((current_key, current_value))

    return key_value_pairs

使用示例：

text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
    Street
 45
Company_NAME: MultiVendor
XXVideos"""

parse_funky_text(text)

将输出：

[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]

Answer 2

您在评论中指出您的内容输入字符串应该相对一致。如果是这种情况，并且您希望能够将该字符串拆分为多行，最简单的方法是用空格替换\n，然后解析单个字符串。

我故意限制我使用字符串方法的答案，而不是发明一个巨大的函数来做到这一点。原因：1）您的过程已经足够复杂了，2）您的问题实际上归结为如何跨多行处理字符串数据。如果是这种情况，并且模式是一致的，那么这将完成这项工作

content = content.replace('\n', ' ')

然后，您可以拆分一致结构化标题中的每个边界。

content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1]  #take second element of the list
address = content.split("Company_NAME:")[0] # take content before 
company = content.split("Adress_HOME:")[1]  #take second element of the list (the remainder) which is company

通常，我会建议正则表达式。（https://docs.python.org/3.4/library/re.html）。从长远来看，如果你需要再做一次这样的事情，正则表达式将按时支付红利数据。要使正则表达式函数“切割”多行，您可以使用re.MULTILINE选项。所以最终可能看起来像re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)

字符串切割多行

您如何改进我的代码？

2 个答案: