Question

我需要读取一个巨大的（大于内存）未加引号的TSV文件。字段可以包含字符串“\ n”。然而，python试图变得聪明并将该字符串分成两部分。例如，一行包含：

cat    dog    fish\nchips    4.50

分为两行：

['cat', 'dog', 'fish']
['chips', 4.5]

我想要的是一行：

['cat', 'dog', 'fish\nchips', 4.5]

如何让python停止聪明，只在0x0a上分割线？

我的代码是：

with open(path, 'r') as file:
    for line in file:                   
        row = line.split("\t")

引用TSV文件不是一个选项，因为我自己不创建它。

Answer 1

这已经正常工作;对于文字\后跟文字n字符（两个字节）的文件，Python会将永远视为新行。

那么，你拥有的是一个\n个字符，一个实际的换行符。您文件的 rest 由\r\n Windows常规行分隔符分隔。

使用io.open()来控制换行符的处理方式：

import io

with io.open(path, newline='\r\n') as infh:
    for line in infh:
        row = line.strip().split('\t')

演示：

>>> import io
>>> with open('/tmp/test.txt', 'wb') as outfh:
...     outfh.write('cat\tdog\tfish\nchips\t4.50\r\nsnake\tegg\tspam\nham\t42.38\r\n')
...
>>> with io.open('/tmp/test.txt', newline='\r\n') as infh:
...     for line in infh:
...         row = line.strip().split('\t')
...         print row
... 
[u'cat', u'dog', u'fish\nchips', u'4.50']
[u'snake', u'egg', u'spam\nham', u'42.38']

请注意，io.open()还会将您的文件数据解码为unicode;您可能需要为非ASCII文件数据指定显式编码。

Answer 2

如果您的问题是.readline（）并在\ t上拆分，请尝试使用csv内置：

import csv

with open(path, 'r') as file: 
    reader = csv.Reader(file, delimiter='\t') # Or DictReader - I like DictReader. 
    reader.next()

它为我们处理这些事情。

Python：读取严格按0x0a分隔的文件，而不是'\ n'字符串

2 个答案: