我一直在慢慢学习PyParsing,并发现它是一个很有潜力的很好的工具,但由于缺乏详细的文档,我很挣扎。因此,我遇到了问题。
我的目标是解析CSV文件,其中包含构成数据组的列集合。这些群体很重要 以后解释后处理中的数据。 此外,CSV文件具有可选列,这就是为什么我真的喜欢pyparsing,因为它具有灵活性。
我已经成功创建了一个解析器,用于验证和正确解析CSV文件的标头。 但是,我有两个选项可以正确处理数据行。
1)我可以根据标题的parserResults为数据创建另一个解析器。 这样数据解析器就知道它应该期望哪些列。
OR
2)以数组的形式读取数据行,以及如何从头解析器中检索每个头字段的列号(不是字符数)。
下面是一个玩具示例,用于说明我想要实现的目标。
csv_header_1='''FirstName Surname Address Notes PurchaseOrder OrderDate'''
csv_data_1='''"Bob" "Smith" "123 Lucky Street" "Bad customer" "123ABC", 2013/10/20
"Zoe" "Jackson" "5 Mountain View Street" "Good customer" "abc211" 2014/01/01'''.splitlines()
csv_header_2='''FirstName Surname Address PhoneHome PhoneMobile PurchaseOrder OrderDate Total'''
csv_data_2='''"Bob" "Smith" "123 Lucky Street" "12345678" "1234567890" "123ABC" 2013/10/20, $100
"Zoe" "Jackson" "5 Mountain View Street" "87654321" "0987654321" "abc211" 2014/01/01 $1000'''.splitlines()
# Pyparsing header parser:
print 'Create pyparsing Elements.'
firstname=Literal('FirstName').setResultsName('Firstname')
surname=Literal('Surname').setResultsName('Surname')
address=Literal('Address').setParseAction( lambda tokens: " ".join(tokens)).setResultsName('Address')
notes=Literal('Notes').setResultsName('Notes')
phone_home= Literal('PhoneHome').setResultsName('Home')
phone_mobile= Literal('PhoneMobile').setResultsName('Mobile')
customer=Group(firstname + surname + address + Optional(notes) + Optional(phone_home + phone_mobile) ).setResultsName('Customer')
purchase_order= Literal('PurchaseOrder').setResultsName('Purchase_order')
order_date= Literal('OrderDate').setResultsName('Order_date')
total= Literal('Total').setResultsName('Total')
order = Group(purchase_order + order_date + Optional(total) ).setResultsName('Order')
header=Group( customer + order ).setResultsName('Header')
print 'Parse CSV header 1.'
try:
parsed_header = header.parseString(csv_header_1)
except ParseException, err:
print err.line
print " "*(err.column-1) + "^"
print err
print 'CSV header 1 dump: ', parsed_header.dump()
try:
parsed_header = header.parseString(csv_header_2)
except ParseException, err:
print err.line
print " "*(err.column-1) + "^"
print err
print 'CSV header 2 dump: ', parsed_header.dump()
输出:
Create pyparsing Elements.
Parse CSV header 1.
CSV header 1 dump: [[['FirstName', 'Surname', 'Address', 'Notes'], ['PurchaseOrder', 'OrderDate']]]
- Header: [['FirstName', 'Surname', 'Address', 'Notes'], ['PurchaseOrder', 'OrderDate']]
- Customer: ['FirstName', 'Surname', 'Address', 'Notes']
- Address: Address
- Firstname: FirstName
- Notes: Notes
- Surname: Surname
- Order: ['PurchaseOrder', 'OrderDate']
- Order_date: OrderDate
- Purchase_order: PurchaseOrder
CSV header 2 dump: [[['FirstName', 'Surname', 'Address', 'PhoneHome', 'PhoneMobile'], ['PurchaseOrder', 'OrderDate', 'Total']]]
- Header: [['FirstName', 'Surname', 'Address', 'PhoneHome', 'PhoneMobile'], ['PurchaseOrder', 'OrderDate', 'Total']]
- Customer: ['FirstName', 'Surname', 'Address', 'PhoneHome', 'PhoneMobile']
- Address: Address
- Firstname: FirstName
- Home: PhoneHome
- Mobile: PhoneMobile
- Surname: Surname
- Order: ['PurchaseOrder', 'OrderDate', 'Total']
- Order_date: OrderDate
- Purchase_order: PurchaseOrder
- Total: Total
标头解析器效果很好,但是如何正确解析数据行呢?
我知道我可以编写一个基于每个字段的数据类型的数据解析器,但这不起作用,因为可选列不一定具有唯一的数据类型。我需要使用标头来确定每列中有多少列以及每列中的数据类型。
我可以在下面手动创建解析器规则,但我需要创建"客户"和"命令"动态解析paleElements可以正确解析行数据。 (请注意,下面的代码片段不处理双引号)
firstname=Word(alphas).setResultsName('Firstname')
surname=Word(alphas).setResultsName('Surname')
address=OneOrMore(Word(alphas)).setParseAction( lambda tokens: " ".join(tokens)).setResultsName('Address')
phone_home= Word(nums).setResultsName('Home')
phone_mobile= Word(nums).setResultsName('Mobile')
# customer=Group(firstname + surname + address + Optional(phone_home) + Optional(phone_mobile) ).setResultsName('Customer')
purchase_order= Word(alphas).setResultsName('Purchase_order')
order_date= Combine(nums + "/" + nums + "/" + nums).setResultsName('Date')
total= Group( Suppress('$') + Word(nums) ).setResultsName('Total')
# order = Group(purchase_order + order_date + Optional(total) ).setResultsName('Order')
感谢您的帮助,我们将不胜感激。
更新
下面是示例输出我希望从数据行的pyparsing解析器中获取。以下示例仅适用于上面给出的每个CSV示例的单行数据。
CSV data 1 dump: [[["Bob" "Smith" "123 Lucky Street" "Bad customer"], ["123ABC", 2013/10/20]]]
- Header: [["Bob" "Smith" "123 Lucky Street" "Bad customer"], ["123ABC", 2013/10/20]]
- Customer: ["Bob" "Smith" "123 Lucky Street" "Bad customer"]
- Address: "123 Lucky Street"
- Firstname: "Bob"
- Notes: "Bad customer"
- Surname: Smith"
- Order: ["123ABC", 2013/10/20]
- Order_date: 2013/10/20
- Purchase_order: "123ABC"
CSV data 2 dump: [[["Bob" "Smith" "123 Lucky Street" "12345678" "1234567890"], [ "123ABC" 2013/10/20, $100]]]
- Header: [["Bob" "Smith" "123 Lucky Street" "12345678" "1234567890"], [ "123ABC" 2013/10/20, $100]]
- Customer: ["Bob" "Smith" "123 Lucky Street" "12345678" "1234567890"]
- Address: "123 Lucky Street"
- Firstname: "Bob"
- Home: "12345678"
- Mobile: "1234567890"
- Surname: "Smith"
- Order: [ "123ABC" 2013/10/20, $100]
- Order_date: 2013/10/20
- Purchase_order: "123ABC"
- Total: $100
这只是一个例子,但我对Jan和EOL所建议的不同方法持开放态度。
答案 0 :(得分:2)
检查documentation csv
模块,内置一个,然后你会找到DictReader
,它允许你处理带有标题的CSV文件,并为每个模块提供迭代器record / line返回一个字典,为每个字段名称提供一个键和相关值。
将此数据包含在" data.csv"文件:
name;surname
Jan;Vlcinsky
Pieter;Pan
Jane;Fonda
然后您可以从控制台进行测试:
>>> from csv import DictReader
>>> fname = "data.csv"
>>> f = open(fname)
>>> reader = DictReader(f, delimiter=";")
>>> for rec in reader:
... print rec
...
{'surname': 'Vlcinsky', 'name': 'Jan'}
{'surname': 'Pan', 'name': 'Pieter'}
{'surname': 'Fonda', 'name': 'Jane'}
使用您的数据并使用StringIO模拟打开的文件:
from StringIO import StringIO
from csv import DictReader
data1 = """
FirstName Surname Address Notes PurchaseOrder OrderDate
"Bob" "Smith" "123 Lucky Street" "Bad customer" "123ABC", 2013/10/20
"Zoe" "Jackson" "5 Mountain View Street" "Good customer" "abc211" 2014/01/01
""".strip()
data2 = """
FirstName Surname Address PhoneHome PhoneMobile PurchaseOrder OrderDate Total
"Bob" "Smith" "123 Lucky Street" "12345678" "1234567890" "123ABC" 2013/10/20, $100
"Zoe" "Jackson" "5 Mountain View Street" "87654321" "0987654321" "abc211" 2014/01/01 $1000
""".strip()
buf1 = StringIO(data1)
buf2 = StringIO(data2)
reader = DictReader(buf1, delimiter=" ")
for rec in reader:
print rec
print "---next one comes---"
reader = DictReader(buf2, delimiter=" ")
for rec in reader:
print rec
会显示什么:
{'Surname': 'Smith', 'FirstName': 'Bob', 'Notes': 'Bad customer', 'PurchaseOrder': '123ABC,', 'Address': '123 Lucky Street', 'OrderDate': '2013/10/20'}
{'Surname': 'Jackson', 'FirstName': 'Zoe', 'Notes': 'Good customer', 'PurchaseOrder': 'abc211', 'Address': '5 Mountain View Street', 'OrderDate': '2014/01/01'}
---next one comes---
{'Surname': 'Smith', 'FirstName': 'Bob', 'PhoneMobile': '1234567890', 'PhoneHome': '12345678', 'PurchaseOrder': '123ABC', 'Address': '123 Lucky Street', 'Total': '$100', 'OrderDate': '2013/10/20,'}
{'Surname': 'Jackson', 'FirstName': 'Zoe', 'PhoneMobile': '0987654321', 'PhoneHome': '87654321', 'PurchaseOrder': 'abc211', 'Address': '5 Mountain View Street', 'Total': '$1000', 'OrderDate': '2014/01/01'}
这样我们就完成了解析部分,剩下的就是稍后从中创建适当的对象。
问题是使用PyParser作为一种类实例。这是一个例子,我们如何创建自己的类。
档案 classes.py :
class Base():
templ = """
- Base:
- ????
"""
reprtempl = "<Base: {self.__dict__}>"
def report(self):
return self.templ.strip().format(self=self)
def __repr__(self):
return self.reprtempl.format(self=self)
class Customer(Base):
templ = """
- Customer:
- Address: {self.address}
- Firstname: {self.first_name}
- Surname: {self.surname}
- Notes: {self.notes}
"""
reprtempl = "<Customer: {self.__dict__}>"
def __init__(self, first_name, surname, address, phone_home=None, phone_mobile=None, notes=None, **kwargs):
self.first_name = first_name
self.surname = surname
self.address = address
self.notes = notes
self.phone_home = phone_home
self.phone_mobile = phone_mobile
class Order(Base):
templ = """
- Order:
- Order_date: {self.order_date}
- Purchase_order: {self.purchase_order}
- Total: {self.total}
"""
reprtempl = "<Order: {self.__dict__}>"
def __init__(self, order_date, purchase_order, total=None, **kwargs):
self.order_date = order_date
self.purchase_order = purchase_order
self.total = total
if __name__ == "__main__":
customer_dct = {"first_name": "Bob", "surname": "Smith", "address": "Sezam Street 1A",
"phone_home": "11223344", "phone_mobile": "88990077"}
customer = Customer(**customer_dct)
print customer
print customer.report()
order_dct = {"order_date": "2014/01/01", "purchase_order": "abc211", "total": "$12"}
order = Order(**order_dct)
print order
print order.report()
基类正在实施__repr__
和report
,是以下课程Customer
和Order
的共同基础。
构造函数使用默认值(对于情况,我们期望给定属性有时会丢失)和**kwargs
,这使得构造函数可以容忍额外(意外)命名参数。
最后一节if __name__ ...
包括简短的测试代码。如果你运行
$ python classes.py
你会看到类实例并在实际中使用。
注意:以下代码使用了一些位修改的字段名称 - 只是为了遵循Python类中的命名约定。原始字段名称可以使用,但是为了遵循类中的命名约定,必须添加一些关键字转换步骤(我跳过它)。
from StringIO import StringIO
from csv import DictReader
from classes import Customer, Order
data1 = """
first_name surname address notes purchase_order order_date
"Bob" "Smith" "123 Lucky Street" "Bad customer" "123ABC", 2013/10/20
"Zoe" "Jackson" "5 Mountain View Street" "Good customer" "abc211" 2014/01/01
""".strip()
data2 = """
first_name surname address phone_home phone_mobile purchase_order order_date total
"Bob" "Smith" "123 Lucky Street" "12345678" "1234567890" "123ABC" 2013/10/20, $100
"Zoe" "Jackson" "5 Mountain View Street" "87654321" "0987654321" "abc211" 2014/01/01 $1000
""".strip()
buf1 = StringIO(data1)
buf2 = StringIO(data2)
reader = DictReader(buf1, delimiter=" ")
for rec in reader:
print rec
customer = Customer(**rec)
print customer.report()
order = Order(**rec)
print order
print order.report()
print "---next one comes---"
reader = DictReader(buf2, delimiter=" ")
for rec in reader:
print rec
customer = Customer(**rec)
print customer.report()
order = Order(**rec)
print order
print order.report()
csv
允许读入DictReader,它以字典项的形式提供记录report
)。