read_csv使用不常见的分隔符

时间:2014-09-13 17:55:27

标签: csv pandas

我有一个csv文件,它使用þ作为引用,段落符号作为逗号分隔值。

使用子类csv.Dialect不起作用。熊猫不会将þ值解释为字符串。

有什么想法吗?

# This works when the delimiters are more standard (; ")
# But really trying to make it work with the ASCII chars commented out below

import csv

f = open('./data/Test_Quote_SemiColon.dat')

class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'  # ASCII: 020
    quotechar = '"'  # ASCII: 254

reader = csv.reader(f, dialect=my_dialect, quoting=1)

for line in reader:
    print line

这是(引用和分号)数据:

“BEGID”; “endID所”, “名称”, “要”, “从”; “CC”, “BCC” “ABC_001”;“ABC_004”;“史密斯,约翰”;“Doe,John”;“Roe,Jane”;“”;“” “ABC_005”;“ABC_007”;“史密斯,约翰”;“Doe,John”;“”;“”;“” “ABC_008”;“ABC_012”;“Doe,John”;“Doe,John”;“Smith,John”;“”;“”

1 个答案:

答案 0 :(得分:0)

我发现文字和chr(254)都解决了这个问题。这看起来是对的吗?

>>> import StringIO
>>> txt = '''þBEGIDþþENDIDþþNameþþToþþFromþþCCþþBCCþ þABC_001þþaBC_004þþSmith, JohnþþDoe, JohnþRoe, Janeþþþþþ þABC_005þþaBC_007þþSmith, JohnþþDoe, Johnþþþþþþ þABC_008þþaBC_012þþDoe, JohnþþDoe, JohnþSmith, Johnþþþþþ'''
>>> reader = csv.reader(StringIO.StringIO(txt), delimiter=',', quotechar=chr(254))
>>> for line in reader: 
...     for entry in line:
...         print unicode(entry, 'utf8')
... 
þBEGIDþþENDIDþþNameþþToþþFromþþCCþþBCCþ þABC_001þþaBC_004þþSmith
 JohnþþDoe
 JohnþRoe
 Janeþþþþþ þABC_005þþaBC_007þþSmith
 JohnþþDoe
 Johnþþþþþþ þABC_008þþaBC_012þþDoe
 JohnþþDoe
 JohnþSmith
 Johnþþþþþ

txt回声如下:

>>> txt
'\xc3\xbeBEGID\xc3\xbe\xc3\xbeENDID\xc3\xbe\xc3\xbeName\xc3\xbe\xc3\xbeTo\xc3\xbe\xc3\xbeFrom\xc3\xbe\xc3\xbeCC\xc3\xbe\xc3\xbeBCC\xc3\xbe \xc3\xbeABC_001\xc3\xbe\xc3\xbeaBC_004\xc3\xbe\xc3\xbeSmith, John\xc3\xbe\xc3\xbeDoe, John\xc3\xbeRoe, Jane\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe \xc3\xbeABC_005\xc3\xbe\xc3\xbeaBC_007\xc3\xbe\xc3\xbeSmith, John\xc3\xbe\xc3\xbeDoe, John\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe \xc3\xbeABC_008\xc3\xbe\xc3\xbeaBC_012\xc3\xbe\xc3\xbeDoe, John\xc3\xbe\xc3\xbeDoe, John\xc3\xbeSmith, John\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe'