我有一个包含1300万行的CSV。数据不是封装的引用,它包含换行符,这会导致一行数据有换行符。每行数据没有多个中断,只有一个。
我如何获取这样的数据?
Line of data
Line of data
continuation of previous line of data
Line of data
Line of data
continuation of previous line
Line of data
把它变成这个:
Line of data
Line of data continuation of previous line of data
Line of data
Line of data continuation of previous line
Line of data
我已经通过将该行存储在变量中并处理下一个变量来测试这一点,查找第一个字符,除了' L'并附加它。我还尝试使用f.tell()
和f.seek()
在文件中移动,但我还没有能够让它工作。
答案 0 :(得分:3)
假设每一行以空格开头,它应该与前一行连接,这应该有效:
with open(data) as infile:
previous_line = None
for line in infile:
if previous_line is None:
previous_line = line
if line.startswith(' '):
line = previous_line.strip() + line
previous_line = line
print(line.strip())
答案 1 :(得分:2)
这是一个廉价,合理有效的续线连接器。
def cont_lines(source):
last_line = ''
for line in source:
if line.startswith(' '):
last_line += line.lstrip() # append a continuation
else:
if last_line:
yield last_line
last_line = line
if last_line: # The one remaining as the source has ended.
yield last_line
像这样使用:
with open("tile.csv") as f:
for line in cont_lines(f):
# do something with line
它只使用与文件中最长的连续行集一样多的内存。
答案 2 :(得分:0)
我能够找到一些东西。
infile = "test.txt"
def peek_line(f):
pos = f.tell()
line = f.readline()
f.seek(pos)
return line
f = open(infile, 'r')
while True:
line = f.readline()
if not line:
break
peek = peek_line(f)
if not peek.startswith('T'):
line = (line.strip() + f.readline())
print line,
我愿意接受有关此方法的反馈。