我从GAE中的表单上传了一个csv / tsv文件,我尝试用python csv模块解析该文件。
与描述here一样,GAE中上传的文件是字符串 所以我将上传的字符串视为一个类文件对象:
file = self.request.get('catalog')
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)
但是我的文件中的新行不一定是'\ n'(感谢excel ..),它产生了一个错误:
错误:在不带引号的字段中看到的换行符 - 您是否需要以通用换行模式打开文件?
有没有人知道如何使用StringIO.StringIO来处理在universal-newline中打开的文件等字符串?
答案 0 :(得分:5)
怎么样:
file = self.request.get('catalog')
file = '\n'.join(file.splitlines())
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)
或者在评论中指出,csv.reader()
支持列表中的输入,因此:
file = self.request.get('catalog')
catalog = csv.reader(file.splitlines(),dialect=csv.excel_tab)
或者如果将来request.get
支持阅读模式:
file = self.request.get('catalog', 'rU')
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)
答案 1 :(得分:4)
描述的解决方案here应该有效。通过如下定义迭代器类,一次加载blob 1MB,使用.splitlines()拆分行,然后一行一行地向CSV读取器提供行,可以处理换行而无需加载整个文件进入记忆。
class BlobIterator:
"""Because the python csv module doesn't like strange newline chars and
the google blob reader cannot be told to open in universal mode, then
we need to read blocks of the blob and 'fix' the newlines as we go"""
def __init__(self, blob_reader):
self.blob_reader = blob_reader
self.last_line = ""
self.line_num = 0
self.lines = []
self.buffer = None
def __iter__(self):
return self
def next(self):
if not self.buffer or len(self.lines) == self.line_num + 1:
self.buffer = self.blob_reader.read(1048576) # 1MB buffer
self.lines = self.buffer.splitlines()
self.line_num = 0
# Handle special case where our block just happens to end on a new line
if self.buffer[-1:] == "\n" or self.buffer[-1:] == "\r":
self.lines.append("")
if not self.buffer:
raise StopIteration
if self.line_num == 0 and len(self.last_line) > 0:
result = self.last_line + self.lines[self.line_num] + "\n"
else:
result = self.lines[self.line_num] + "\n"
self.last_line = self.lines[self.line_num + 1]
self.line_num += 1
return result
然后这样称呼:
blob_reader = blobstore.BlobReader(blob_key)
blob_iterator = BlobIterator(blob_reader)
reader = csv.reader(blob_iterator)