Python CSV模块在字段内的引号内处理逗号

时间:2014-08-27 16:37:23

标签: python csv

我正在使用Python的csv模块来解析我的应用程序中的CSV文件中的数据。在测试应用程序时,我的同事输入了一个从随机网站上复制粘贴的示例文本。

示例文本在字段内有双引号,双引号内有逗号。双引号之外的逗号由csv模块正确处理,但双引号内的逗号分为下一列。我查看了csv规范,该字段符合规范,通过另一组双引号转义双引号。

我在libreoffice中检查了文件,并且处理得当。

这是我遇到问题的csv数据中的一行:

company_name,company_revenue,company_start_year,company_website,company_description,company_email
Acme Inc,80000000000000,2004,http://google.com,"The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as ""Acme Rocket-Powered Products, Inc."" based in Fairfield, New Jersey. Many of its products appear to be produced specifically for Wile E. Coyote; for example, the Acme Giant Rubber Band, subtitled ""(For Tripping Road Runners)"".

Sometimes, Acme can also send living creatures through the mail, though that isn't done very often. Two examples of this are the Acme Wild-Cat, which had been used on Elmer Fudd and Sam Sheepdog (which doesn't maul its intended victim); and Acme Bumblebees in one-fifth bottles (which sting Wile E. Coyote). The Wild Cat was used in the shorts Don't Give Up the Sheep and A Mutt in a Rut, while the bees were used in the short Zoom and Bored.

While their products leave much to be desired, Acme delivery service is second to none; Wile E. can merely drop an order into a mailbox (or enter an order on a website, as in the Looney Tunes: Back in Action movie), and have the product in his hands within seconds.",roadrunner@acme.com

以下是调试日志中的内容:

2014-08-27 21:35:53,922 - DEBUG: company_website=http://google.com
2014-08-27 21:35:53,923 - DEBUG: company_revenue=80000000000000
2014-08-27 21:35:53,923 - DEBUG: company_start_year=2004
2014-08-27 21:35:53,923 - DEBUG: account_description=The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as "Acme Rocket-Powered Products
2014-08-27 21:35:53,924 - DEBUG: company_name=Acme Inc
2014-08-27 21:35:53,925 - DEBUG: company_email=Inc."" based in Fairfield

处理csv解析的相关代码:

with open(csvfile, 'rU') as contactsfile:
    # sniff for dialect of csvfile so we can automatically determine
    # what delimiters to use
    try:
        dialect = csv.Sniffer().sniff(contactsfile.read(2048))
    except:
        dialect = 'excel'
    get_total_jobs(contactsfile, dialect)
    contacts = csv.DictReader(contactsfile, dialect=dialect, skipinitialspace=True, quoting=csv.QUOTE_MINIMAL)
    # Start reading the rows
    for row in contacts:
        process_job()
        for key, value in row.iteritems():
            logging.debug("{}={}".format(key,value))

我知道这只是垃圾数据,我们可能永远不会遇到这样的数据,但我们收到的csv文件不在我们的控制之内,我们可以有这样的边缘情况。因为它是一个有效的csv文件,由libreoffice正确处理,所以我也能正确处理它。

我已经搜索了有关csv处理的其他问题,其中人们在字段内处理引号或逗号时遇到问题。我有这两个工作正常,我的问题是当一个逗号嵌套在一个字段内的引号。有一个同样问题的问题可以解决问题Comma in DoubleDouble Quotes in CSV File,但这是一种hackish方式,我没有保留给他的内容,这是一种有效的方式,根据RFC4180。

1 个答案:

答案 0 :(得分:2)

Dialect.doublequote attribute

  

控制在字段内出现的quotechar实例应该如何   自己被引用。如果为True,则角色加倍。当假,   escapechar用作quotechar的前缀。它默认为   真。

嗅探器将doublequote属性设置为False,但您发布的CSV应使用doublequote = True进行解析:

import csv
with open(csvfile, 'rb') as contactsfile:
    # sniff for dialect of csvfile so we can automatically determine
    # what delimiters to use
    try:
        dialect = csv.Sniffer().sniff(contactsfile.read(2048))
    except:
        dialect = 'excel'
    # get_total_jobs(contactsfile, dialect)
    contactsfile.seek(0)
    contacts = csv.DictReader(contactsfile, dialect=dialect, skipinitialspace=True,
                              quoting=csv.QUOTE_MINIMAL, doublequote=True)
    # Start reading the rows
    for row in contacts:
        for key, value in row.iteritems():
            print("{}={}".format(key,value))

产量

company_description=The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as "Acme Rocket-Powered Products, Inc." based in Fairfield, New Jersey. Many of its products appear to be produced specifically for Wile E. Coyote; for example, the Acme Giant Rubber Band, subtitled "(For Tripping Road Runners)".

Sometimes, Acme can also send living creatures through the mail, though that isn't done very often. Two examples of this are the Acme Wild-Cat, which had been used on Elmer Fudd and Sam Sheepdog (which doesn't maul its intended victim); and Acme Bumblebees in one-fifth bottles (which sting Wile E. Coyote). The Wild Cat was used in the shorts Don't Give Up the Sheep and A Mutt in a Rut, while the bees were used in the short Zoom and Bored.

While their products leave much to be desired, Acme delivery service is second to none; Wile E. can merely drop an order into a mailbox (or enter an order on a website, as in the Looney Tunes: Back in Action movie), and have the product in his hands within seconds.
company_website=http://google.com
company_start_year=2004
company_name=Acme Inc
company_revenue=80000000000000
company_email=roadrunner@acme.com

另外,per the docs,在Python2中,文件句柄应该以'rb'模式打开,而不是'rU'模式:

  

如果csvfile是文件对象,则必须打开'b'标志   那些有所作为的平台。