我想从包含unicode字符串的csv文件中导出数据。
以前我尝试过一个Python脚本,它只适用于ASCII数据。但它也不支持unicode:
#! /usr/bin/env python
import csv
csv.register_dialect('custom',delimiter=','
doublequote=True,
escapechar=None,
quotechar='"',
quoting=csv.QUOTE_MINIMAL, skipinitialspace=False)
with open('input.csv') as ifile:
data = csv.reader(ifile, dialect='custom')
for record in data:
for i, field in enumerate(record):
print (" <field%s>" % i + field + "</field%s>" % i)
Traceback(最近一次调用最后一次):对于数据中的记录:_csv.Error: line包含NULL byte
答案 0 :(得分:2)
使用此unicode-csv库代替
https://github.com/jdunck/python-unicodecsv
import unicodecsv as csv
with open('input.csv') as ifile:
rows = [row for row in csv.reader(ifile, encoding='utf-8')]
print rows
答案 1 :(得分:1)
您可以将csv.reader
包装在一个类中以便为您处理。以下摘自csv documentation examples并适用于我:
#! /usr/bin/env python
import csv, codecs
class UTF8Recoder:
"""
Iterator that reads an encoded stream and reencodes the input to UTF-8
"""
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("utf-8")
class UnicodeReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
csv.register_dialect('custom', delimiter=',',
doublequote=True,
escapechar=None,
quotechar='"',
quoting=csv.QUOTE_MINIMAL, skipinitialspace=False)
with open('input.csv') as ifile:
data = UnicodeReader(ifile, dialect='custom')
for record in data:
for i, field in enumerate(record):
print (" <field%s>" % i + field + "</field%s>" % i)
如果您需要该功能,还有一个UnicodeWriter
课程。
答案 2 :(得分:0)
您似乎正在使用Python 3.请关注the very first code example in the docs:
#!/usr/bin/env python3
import csv
with open('input.csv', newline='', encoding=encoding) as csvfile:
reader = csv.reader(csvfile, dialect="custom")
for row in reader:
print(", ".join(row))
其中“自定义”方言在您的问题的代码中定义,而encoding
是您的文件的字符编码,例如“utf-16”。 If you omit encoding
argument; the encoding returned by locale.getpreferredencoding(False)
is used