字符串切割多行

时间:2018-02-20 16:07:29

标签: python string csv cut

所以除了使用tKintner的一些经验(一些GUI实验)之外,我是python的新手。

我读了.mbox文件并将字符串中的plain / text复制。本文包含注册表格。因此,住在伦敦Maple Street的Stefan为公司“MultiVendor XXVideos”工作,已经注册了一封电子邮件进行订阅。

Name_OF_Person: Stefan
Adress_HOME: London, Maple
    Street
 45
Company_NAME: MultiVendor
XXVideos

我想获取这些数据,并在列中添加.csv行   “姓名”,“地址”,“公司”,......

现在我试图剪切切片。为了调试,我使用“print”(IDE = KATE / KDE + terminal ... :-D)。   问题是,数据在关键字后面包含多行,但我只得到第一行。

您如何改进我的代码?

import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"

if __name__ == "__main__":
 with open(export_file_name,"w") as csvfile:
 writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
 writer.writeheader()

 for message in mailbox.mbox(mbox_file):
   if message.is_multipart():
     content = '\n'.join(part.get_payload() for part in message.get_payload())
     content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
     #print content
   else:
     content = message.get_payload()
   idea = message['message-id']
   sub =  message['subject']
   fr = message['from']
   date = message['date']
   writer.writerow ('ID':idea,......) # CSV writing will work fine

   for line in content.splitlines():
     line = line.strip()
      for pose in searchKeys: 
       if pose in line: 
         tmp = line.split(pose)
         pmt = tmp[1].split(":")[1]
         if next in line !=: 
         print pose +"\t"+pmt
       sleep(1)
csvfile.closed

输出:

OFFICIAL_POSTAL_ADDRESS  =20

这里缺少线条.. 来自档案:

OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41

EDIT2:

@Yaniv 谢谢你,我仍然试图了解每一步,但只是想发表评论。我喜欢使用list / matrix / vector“key_value_pairs”

的想法

电子邮件中的关键字数量约为20个字。另外,我的值有时被“=”除线。 我想的是:

Search text for Keyword A, 
if true: 
 search text from Keyword A until keyword B 
 if true:
  copy text after A until B

Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
 45
Company_NAME: MultiVendor
XXVideos

也许来自EMAIL.mbox的HTML更容易处理?

<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
 E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font    
 fac=e=3D"Verdana" size=3D"1">Stefan&nbsp;</font></td></tr>

但是“=”仍在那里 我应该用“”替换[“=”,“=”]吗?

2 个答案:

答案 0 :(得分:1)

我会选择&#34;例程&#34;解析输入行上的循环,并维护current_keycurrent_value变量,作为数据中某个可能是& #34;恼人的&#34;,并分散在多条线上。

我已经在下面的代码中展示了这种解析方法,并对您的问题做了一些假设。例如,如果输入行以空格开头,我认为必须是这样的&#34;烦人的&#34;价值(分布在多条线上)。使用一些可配置的字符串(参数join_lines_using_this)将这些行连接成单个值。另一个假设是你可能想要从键和值中去除空格。

随意调整代码以适应您对输入的假设,并在他们不持有时提出异常!

# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):

    key_value_pairs = []

    current_key, current_value = None, ""
    for line in text.splitlines():
        line_split = line.split(':')
        if line.startswith(" ") or len(line_split) == 1:
            if current_key is None:
                raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
            current_value += join_lines_using_this + line.strip()
        else:
            if current_key is not None:
                key_value_pairs.append((current_key, current_value))
                current_key, current_value = None, ""
            current_key = line_split[0].strip()
            # We've just found a new key, so here you might want to perform additional checks,
            # e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
            current_value = ':'.join(line_split[1:]).strip()

    # Don't forget the last parsed key, value
    if current_key is not None:
        key_value_pairs.append((current_key, current_value))

    return key_value_pairs

使用示例:

text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
    Street
 45
Company_NAME: MultiVendor
XXVideos"""

parse_funky_text(text)

将输出:

[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]

答案 1 :(得分:-1)

您在评论中指出您的内容输入字符串应该相对一致。如果是这种情况,并且您希望能够将该字符串拆分为多行,最简单的方法是用空格替换\n,然后解析单个字符串。

我故意限制我使用字符串方法的答案,而不是发明一个巨大的函数来做到这一点。原因:1)您的过程已经足够复杂了,2)您的问题实际上归结为如何跨多行处理字符串数据。如果是这种情况,并且模式是一致的,那么这将完成这项工作

content = content.replace('\n', ' ')

然后,您可以拆分一致结构化标题中的每个边界。

content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1]  #take second element of the list
address = content.split("Company_NAME:")[0] # take content before 
company = content.split("Adress_HOME:")[1]  #take second element of the list (the remainder) which is company

通常,我会建议正则表达式。 (https://docs.python.org/3.4/library/re.html)。从长远来看,如果你需要再做一次这样的事情,正则表达式将按时支付红利数据。要使正则表达式函数“切割”多行,您可以使用re.MULTILINE选项。所以最终可能看起来像re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)