解析文本文件

时间:2015-08-10 03:21:13

标签: python parsing

我已经构建了一个联系表单,它为每个用户注册发送电子邮件我的问题更多与将一些文本数据解析为csv格式有关。我在邮箱中收到了多个用户信息,我已将其复制到文本文件中。数据如下所示。

Name: testuser2
Email: testuser2@gmail.com
Cluster Name: o  b
Contact No.: 12346971239
Coming: Yes

Name: testuser3
Email: testuser3@gmail.com
Cluster Name: Mediternea
Contact No.: 9121319107
Coming: Yes

Name: testuser4
Email: tuser4@yahoo.com
Cluster Name: Mediterranea
Contact No.: 7892174896
Coming: Yes

Name: tuser5
Email: tuserner5@gmail.com
Cluster Name: River Retreat A
Contact No.: 7583450912
Coming: Yes
Members Participating: 2

Name: Test User
Email: testuser@yahoo.co.in
Cluster Name: RD
Contact No.: 09833123445
Coming: Yes
Members Participating: 2

可以看到数据包含一些常见字段和一些不存在的字段,我正在寻找关于如何解析这些数据的解决方案/建议,因此在“名称”标题下,我将收集名称信息。专栏,其他类似。对于标题为“Members Participating”的数据,我只需选择数字并将其添加到同一标题下的Excel表格中,如果该信息不存在于用户,则可以为空白。

4 个答案:

答案 0 :(得分:1)

以下程序可能符合您的要求。总体战略:

  • 首先阅读所有电子邮件文件,“手动”解析数据,然后
  • 其次使用csv.DictWriter.writerows()将数据写入CSV文件。

import sys
import pprint
import csv

# Usage:
# python cfg2csv.py input1.cfg input2.cfg ...
# The data is combined and written to 'output.csv'

def parse_file(data):
    total_result = []
    single_result = []
    for line in data:
        line = line.strip()
        if line:
            single_result.append([item.strip() for item in line.split(':', 1)])
        else:
            if single_result:
                total_result.append(dict(single_result))
            single_result = []
    if single_result:
        total_result.append(dict(single_result))
    return total_result

def read_file(filename):
    with open(filename) as fp:
        return parse_file(fp)

# First parse the data:
data = sum((read_file(filename) for filename in sys.argv[1:]), [])
keys = set().union(*data)

# Next write the data to a CSV file
with open('output.csv', 'w') as fp:
    writer = csv.DictWriter(fp, sorted(keys))
    writer.writeheader()
    writer.writerows(data)

答案 1 :(得分:1)

您可以使用记录之间的空行来表示记录结束。然后逐行处理输入文件并构造字典列表。最后将字典写入CSV文件。

from csv import DictWriter
from collections import OrderedDict

with open('input') as infile:
    registrations = []
    fields = OrderedDict()
    d = {}
    for line in infile:
        line = line.strip()
        if line:
            key, value = [s.strip() for s in line.split(':', 1)]
            d[key] = value
            fields[key] = None
        else:
            if d:
                registrations.append(d)
                d = {}
    else:
        if d:    # handle EOF
            registrations.append(d)


# fieldnames = ['Name', 'Email', 'Cluster Name', 'Contact No.', 'Coming', 'Members Participating']
fieldnames = fields.keys()

with open('registrations.csv', 'w') as outfile:
    writer = DictWriter(outfile, fieldnames=fields)
    writer.writeheader()
    writer.writerows(registrations)

此代码尝试自动收集字段名称,并将使用与输入中首次出现的唯一键相同的顺序。如果您在输出中需要特定的字段顺序,则可以通过取消注释相应的行来确定它。

在您的示例输入上运行此代码会产生以下结果:

Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2@gmail.com,o  b,12346971239,Yes,
testuser3,testuser3@gmail.com,Mediternea,9121319107,Yes,
testuser4,tuser4@yahoo.com,Mediterranea,7892174896,Yes,
tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
Test User,testuser@yahoo.co.in,RD,09833123445,Yes,2

答案 2 :(得分:1)

让我们将问题分解为更小的子问题:

  1. 将大块文本拆分为单独的注册
  2. 将每个注册转换为字典
  3. 将字典列表写入CSV
  4. 首先,让我们将注册数据块分成不同的元素:

    DATA = '''
    Name: testuser2
    Email: testuser2@gmail.com
    Cluster Name: o  b
    Contact No.: 12346971239
    Coming: Yes
    
    Name: testuser3
    Email: testuser3@gmail.com
    Cluster Name: Mediternea
    Contact No.: 9121319107
    Coming: Yes
    '''
    
    def parse_registrations(data):
        data = data.strip()
        return data.split('\n\n')
    

    此功能为我们提供了每个注册的列表:

    >>> regs = parse_registrations(DATA)
    >>> regs
    ['Name: testuser2\nEmail: testuser2@gmail.com\nCluster Name: o  b\nContact No.: 12346971239\nComing: Yes', 'Name: testuser3\nEmail: testuser3@gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes']
    >>> regs[0]
    'Name: testuser2\nEmail: testuser2@gmail.com\nCluster Name: o  b\nContact No.: 12346971239\nComing: Yes'
    >>> regs[1]
    'Name: testuser3\nEmail: testuser3@gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes'
    

    接下来,我们可以将这些子串转换为(键,值)对的列表:

    >>> [field.split(': ', 1) for field in regs[0].split('\n')]
    [['Name', 'testuser2'], ['Email', 'testuser2@gmail.com'], ['Cluster Name', 'o  b'], ['Contact No.', '12346971239'], ['Coming', 'Yes']]
    

    dict()函数可以将(键,值)对的列表转换为字典:

    >>> dict(field.split(': ', 1) for field in regs[0].split('\n'))
    {'Coming': 'Yes', 'Cluster Name': 'o  b', 'Name': 'testuser2', 'Contact No.': '12346971239', 'Email': 'testuser2@gmail.com'}
    

    我们可以将这些词典传递给csv.DictWriter,将记录写为CSV,默认值为任何缺失值。

    >>> w = csv.DictWriter(open("/tmp/foo.csv", "w"), fieldnames=["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"])
    >>> w.writeheader()
    >>> w.writerow({'Name': 'Steve'})
    12
    

    现在,让我们将这些结合在一起!

    import csv
    
    DATA = '''
    Name: testuser2
    Email: testuser2@gmail.com
    Cluster Name: o  b
    Contact No.: 12346971239
    Coming: Yes
    
    Name: tuser5
    Email: tuserner5@gmail.com
    Cluster Name: River Retreat A
    Contact No.: 7583450912
    Coming: Yes
    Members Participating: 2
    '''
    
    COLUMNS = ["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"]
    
    def parse_registration(reg):
        return dict(field.split(': ', 1) for field in reg.split('\n'))
    
    def parse_registrations(data):
        data = data.strip()
        regs = data.split('\n\n')
        return [parse_registration(r) for r in regs]
    
    def write_csv(data, filename):
        regs = parse_registrations(data)
        with open(filename, 'w') as f:
            writer = csv.DictWriter(f, fieldnames=COLUMNS)
            writer.writeheader()
            writer.writerows(regs)
    
    if __name__ == '__main__':
        write_csv(DATA, "/tmp/test.csv")
    

    输出:

    $ python3 write_csv.py
    
    $ cat /tmp/test.csv
    Name,Email,Cluster Name,Contact No.,Coming,Members Participating
    testuser2,testuser2@gmail.com,o  b,12346971239,Yes,
    tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
    

答案 3 :(得分:0)

以下内容会自动将输入文本文件转换为CSV文件。标题是根据最长的条目自动生成的。

import csv, re

with open("input.txt", "r") as f_input, open("output.csv", "wb") as f_output:
    csv_output = csv.writer(f_output)
    entries = re.findall("^(Name: .*?)(?:\n\n|\Z)", f_input.read(), re.M+re.S)

    # Determine the entry with the most fields for the CSV headers
    headings = []
    for entry in entries:
        headings = max(headings, [line.split(":")[0] for line in entry.split("\n")], key=len)
    csv_output.writerow(headings)

    # Write the entries
    for entry in entries:
        csv_output.writerow([line.split(":")[1].strip() for line in entry.split("\n")])

这将生成一个CSV文本文件,可以在Excel中打开,如下所示:

Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2@gmail.com,o  b,12346971239,Yes
testuser3,testuser3@gmail.com,Mediternea,9121319107,Yes
testuser4,tuser4@yahoo.com,Mediterranea,7892174896,Yes
tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
Test User,testuser@yahoo.co.in,RD,09833123445,Yes,2