使用Pyparsing基于标题字段解析CSV数据

时间:2014-05-15 07:29:05

标签: python parsing csv dynamic pyparsing

我一直在慢慢学习PyParsing,并发现它是一个很有潜力的很好的工具,但由于缺乏详细的文档,我很挣扎。因此,我遇到了问题。

我的目标是解析CSV文件,其中包含构成数据组的列集合。这些群体很重要 以后解释后处理中的数据。 此外,CSV文件具有可选列,这就是为什么我真的喜欢pyparsing,因为它具有灵活性。

我已经成功创建了一个解析器,用于验证和正确解析CSV文件的标头。 但是,我有两个选项可以正确处理数据行。

1)我可以根据标题的parserResults为数据创建另一个解析器。     这样数据解析器就知道它应该期望哪些列。

OR

2)以数组的形式读取数据行,以及如何从头解析器中检索每个头字段的列号(不是字符数)。

下面是一个玩具示例,用于说明我想要实现的目标。

csv_header_1='''FirstName Surname Address Notes PurchaseOrder OrderDate'''

csv_data_1='''"Bob" "Smith" "123 Lucky Street" "Bad customer" "123ABC", 2013/10/20
"Zoe" "Jackson" "5 Mountain View Street" "Good customer" "abc211" 2014/01/01'''.splitlines()


csv_header_2='''FirstName Surname Address PhoneHome PhoneMobile PurchaseOrder OrderDate Total'''

csv_data_2='''"Bob" "Smith" "123 Lucky Street" "12345678" "1234567890" "123ABC" 2013/10/20, $100
"Zoe" "Jackson" "5 Mountain View Street" "87654321" "0987654321" "abc211" 2014/01/01 $1000'''.splitlines()

# Pyparsing header parser:


print 'Create pyparsing Elements.'
firstname=Literal('FirstName').setResultsName('Firstname')
surname=Literal('Surname').setResultsName('Surname')
address=Literal('Address').setParseAction( lambda tokens: " ".join(tokens)).setResultsName('Address')
notes=Literal('Notes').setResultsName('Notes')
phone_home= Literal('PhoneHome').setResultsName('Home')
phone_mobile= Literal('PhoneMobile').setResultsName('Mobile')
customer=Group(firstname + surname + address + Optional(notes) + Optional(phone_home + phone_mobile) ).setResultsName('Customer')

purchase_order= Literal('PurchaseOrder').setResultsName('Purchase_order')
order_date= Literal('OrderDate').setResultsName('Order_date')
total= Literal('Total').setResultsName('Total')
order = Group(purchase_order + order_date + Optional(total) ).setResultsName('Order')


header=Group( customer + order ).setResultsName('Header')

print 'Parse CSV header 1.'

try:
    parsed_header = header.parseString(csv_header_1)
except ParseException, err:
    print err.line
    print " "*(err.column-1) + "^"
    print err


print 'CSV header 1 dump: ', parsed_header.dump()

try:
    parsed_header = header.parseString(csv_header_2)
except ParseException, err:
    print err.line
    print " "*(err.column-1) + "^"
    print err


print 'CSV header 2 dump: ', parsed_header.dump()

输出:

Create pyparsing Elements.
Parse CSV header 1.
CSV header 1 dump:  [[['FirstName', 'Surname', 'Address', 'Notes'], ['PurchaseOrder', 'OrderDate']]]
- Header: [['FirstName', 'Surname', 'Address', 'Notes'], ['PurchaseOrder', 'OrderDate']]
  - Customer: ['FirstName', 'Surname', 'Address', 'Notes']
    - Address: Address
    - Firstname: FirstName
    - Notes: Notes
    - Surname: Surname
  - Order: ['PurchaseOrder', 'OrderDate']
    - Order_date: OrderDate
    - Purchase_order: PurchaseOrder
CSV header 2 dump:  [[['FirstName', 'Surname', 'Address', 'PhoneHome', 'PhoneMobile'], ['PurchaseOrder', 'OrderDate', 'Total']]]
- Header: [['FirstName', 'Surname', 'Address', 'PhoneHome', 'PhoneMobile'], ['PurchaseOrder', 'OrderDate', 'Total']]
  - Customer: ['FirstName', 'Surname', 'Address', 'PhoneHome', 'PhoneMobile']
    - Address: Address
    - Firstname: FirstName
    - Home: PhoneHome
    - Mobile: PhoneMobile
    - Surname: Surname
  - Order: ['PurchaseOrder', 'OrderDate', 'Total']
    - Order_date: OrderDate
    - Purchase_order: PurchaseOrder
    - Total: Total

标头解析器效果很好,但是如何正确解析数据行呢?

我知道我可以编写一个基于每个字段的数据类型的数据解析器,但这不起作用,因为可选列不一定具有唯一的数据类型。我需要使用标头来确定每列中有多少列以及每列中的数据类型。

我可以在下面手动创建解析器规则,但我需要创建"客户"和"命令"动态解析paleElements可以正确解析行数据。 (请注意,下面的代码片段不处理双引号)

firstname=Word(alphas).setResultsName('Firstname')
surname=Word(alphas).setResultsName('Surname')
address=OneOrMore(Word(alphas)).setParseAction( lambda tokens: " ".join(tokens)).setResultsName('Address')
phone_home= Word(nums).setResultsName('Home')
phone_mobile= Word(nums).setResultsName('Mobile')
# customer=Group(firstname + surname + address + Optional(phone_home) + Optional(phone_mobile) ).setResultsName('Customer')

purchase_order= Word(alphas).setResultsName('Purchase_order')
order_date= Combine(nums + "/" + nums + "/" + nums).setResultsName('Date')
total= Group( Suppress('$') + Word(nums) ).setResultsName('Total')
# order = Group(purchase_order + order_date + Optional(total) ).setResultsName('Order')

感谢您的帮助,我们将不胜感激。

更新

下面是示例输出我希望从数据行的pyparsing解析器中获取。以下示例仅适用于上面给出的每个CSV示例的单行数据。

CSV data 1 dump:  [[["Bob" "Smith" "123 Lucky Street" "Bad customer"], ["123ABC", 2013/10/20]]]
- Header: [["Bob" "Smith" "123 Lucky Street" "Bad customer"], ["123ABC", 2013/10/20]]
  - Customer: ["Bob" "Smith" "123 Lucky Street" "Bad customer"]
    - Address: "123 Lucky Street"
    - Firstname: "Bob"
    - Notes: "Bad customer"
    - Surname: Smith"
  - Order: ["123ABC", 2013/10/20]
    - Order_date: 2013/10/20
    - Purchase_order: "123ABC"


CSV data 2 dump:  [[["Bob" "Smith" "123 Lucky Street" "12345678" "1234567890"], [ "123ABC" 2013/10/20, $100]]]
- Header: [["Bob" "Smith" "123 Lucky Street" "12345678" "1234567890"], [ "123ABC" 2013/10/20, $100]]
  - Customer: ["Bob" "Smith" "123 Lucky Street" "12345678" "1234567890"]
    - Address: "123 Lucky Street"
    - Firstname: "Bob"
    - Home: "12345678"
    - Mobile: "1234567890"
    - Surname: "Smith"
  - Order: [ "123ABC" 2013/10/20, $100]
    - Order_date: 2013/10/20
    - Purchase_order: "123ABC"
    - Total: $100

这只是一个例子,但我对Jan和EOL所建议的不同方法持开放态度。

1 个答案:

答案 0 :(得分:2)

CSV文件处理

检查documentation csv模块,内置一个,然后你会找到DictReader,它允许你处理带有标题的CSV文件,并为每个模块提供迭代器record / line返回一个字典,为每个字段名称提供一个键和相关值。

将此数据包含在" data.csv"文件:

name;surname
Jan;Vlcinsky
Pieter;Pan
Jane;Fonda

然后您可以从控制台进行测试:

>>> from csv import DictReader
>>> fname = "data.csv"
>>> f = open(fname)
>>> reader = DictReader(f, delimiter=";")
>>> for rec in reader:
...     print rec
...
{'surname': 'Vlcinsky', 'name': 'Jan'}
{'surname': 'Pan', 'name': 'Pieter'}
{'surname': 'Fonda', 'name': 'Jane'}

使用您的数据并使用StringIO模拟打开的文件:

from StringIO import StringIO
from csv import DictReader

data1 = """
FirstName Surname Address Notes PurchaseOrder OrderDate
"Bob" "Smith" "123 Lucky Street" "Bad customer" "123ABC", 2013/10/20
"Zoe" "Jackson" "5 Mountain View Street" "Good customer" "abc211" 2014/01/01
""".strip()


data2 = """
FirstName Surname Address PhoneHome PhoneMobile PurchaseOrder OrderDate Total
"Bob" "Smith" "123 Lucky Street" "12345678" "1234567890" "123ABC" 2013/10/20, $100
"Zoe" "Jackson" "5 Mountain View Street" "87654321" "0987654321" "abc211" 2014/01/01 $1000
""".strip()

buf1 = StringIO(data1)
buf2 = StringIO(data2)

reader = DictReader(buf1, delimiter=" ")
for rec in reader:
    print rec

print "---next one comes---"

reader = DictReader(buf2, delimiter=" ")
for rec in reader:
    print rec

会显示什么:

{'Surname': 'Smith', 'FirstName': 'Bob', 'Notes': 'Bad customer', 'PurchaseOrder': '123ABC,', 'Address': '123 Lucky Street', 'OrderDate': '2013/10/20'}
{'Surname': 'Jackson', 'FirstName': 'Zoe', 'Notes': 'Good customer', 'PurchaseOrder': 'abc211', 'Address': '5 Mountain View Street', 'OrderDate': '2014/01/01'}
---next one comes---
{'Surname': 'Smith', 'FirstName': 'Bob', 'PhoneMobile': '1234567890', 'PhoneHome': '12345678', 'PurchaseOrder': '123ABC', 'Address': '123 Lucky Street', 'Total': '$100', 'OrderDate': '2013/10/20,'}
{'Surname': 'Jackson', 'FirstName': 'Zoe', 'PhoneMobile': '0987654321', 'PhoneHome': '87654321', 'PurchaseOrder': 'abc211', 'Address': '5 Mountain View Street', 'Total': '$1000', 'OrderDate': '2014/01/01'}

这样我们就完成了解析部分,剩下的就是稍后从中创建适当的对象。

玩类和打印

问题是使用PyParser作为一种类实例。这是一个例子,我们如何创建自己的类。

档案 classes.py

class Base():
    templ = """
    - Base:
        - ????
    """
    reprtempl = "<Base: {self.__dict__}>"
    def report(self):
        return self.templ.strip().format(self=self)
    def __repr__(self):
        return self.reprtempl.format(self=self)


class Customer(Base):
    templ = """
    - Customer:
        - Address: {self.address}
        - Firstname: {self.first_name}
        - Surname: {self.surname}
        - Notes: {self.notes}
    """
    reprtempl = "<Customer: {self.__dict__}>"

    def __init__(self, first_name, surname, address, phone_home=None, phone_mobile=None, notes=None, **kwargs):
        self.first_name = first_name
        self.surname = surname
        self.address = address
        self.notes = notes
        self.phone_home = phone_home
        self.phone_mobile = phone_mobile

class Order(Base):
    templ = """
    - Order:
        - Order_date: {self.order_date}
        - Purchase_order: {self.purchase_order}
        - Total: {self.total}
    """
    reprtempl = "<Order: {self.__dict__}>"

    def __init__(self, order_date, purchase_order, total=None, **kwargs):
        self.order_date = order_date
        self.purchase_order = purchase_order
        self.total = total

if __name__ == "__main__":
    customer_dct = {"first_name": "Bob", "surname": "Smith", "address": "Sezam Street 1A",
            "phone_home": "11223344", "phone_mobile": "88990077"}
    customer = Customer(**customer_dct)
    print customer
    print customer.report()
    order_dct = {"order_date": "2014/01/01", "purchase_order": "abc211", "total": "$12"}
    order = Order(**order_dct)
    print order
    print order.report()

基类正在实施__repr__report,是以下课程CustomerOrder的共同基础。

构造函数使用默认值(对于情况,我们期望给定属性有时会丢失)和**kwargs,这使得构造函数可以容忍额外(意外)命名参数。

最后一节if __name__ ...包括简短的测试代码。如果你运行

$ python classes.py

你会看到类实例并在实际中使用。

使用类来收集csv读取

注意:以下代码使用了一些位修改的字段名称 - 只是为了遵循Python类中的命名约定。原始字段名称可以使用,但是为了遵循类中的命名约定,必须添加一些关键字转换步骤(我跳过它)。

from StringIO import StringIO
from csv import DictReader
from classes import Customer, Order

data1 = """
first_name surname address notes purchase_order order_date
"Bob" "Smith" "123 Lucky Street" "Bad customer" "123ABC", 2013/10/20
"Zoe" "Jackson" "5 Mountain View Street" "Good customer" "abc211" 2014/01/01
""".strip()


data2 = """
first_name surname address phone_home phone_mobile purchase_order order_date total
"Bob" "Smith" "123 Lucky Street" "12345678" "1234567890" "123ABC" 2013/10/20, $100
"Zoe" "Jackson" "5 Mountain View Street" "87654321" "0987654321" "abc211" 2014/01/01 $1000
""".strip()

buf1 = StringIO(data1)
buf2 = StringIO(data2)

reader = DictReader(buf1, delimiter=" ")
for rec in reader:
    print rec
    customer = Customer(**rec)
    print customer.report()
    order = Order(**rec)
    print order
    print order.report()

print "---next one comes---"

reader = DictReader(buf2, delimiter=" ")
for rec in reader:
    print rec
    customer = Customer(**rec)
    print customer.report()
    order = Order(**rec)
    print order
    print order.report()

结论

  • python csv允许读入DictReader,它以字典项的形式提供记录
  • 可以创建Python中的自定义类,可以允许使用关键字中的参数集构建,并允许实现方便的方法(例如report)。
  • 示例可以进一步扩展,例如管理客户和订单之间的关系,但这超出了这个答案的范围。