如果文件的下一行包含字符串,请将其附加到当前文件的末尾

时间:2017-03-17 16:23:20

标签: python

我有一个包含1300万行的CSV。数据不是封装的引用,它包含换行符,这会导致一行数据有换行符。每行数据没有多个中断,只有一个。

我如何获取这样的数据?

Line of data
Line of data
 continuation of previous line of data
Line of data
Line of data
 continuation of previous line
Line of data

把它变成这个:

Line of data
Line of data continuation of previous line of data
Line of data
Line of data continuation of previous line
Line of data

我已经通过将该行存储在变量中并处理下一个变量来测试这一点,查找第一个字符,除了' L'并附加它。我还尝试使用f.tell()f.seek()在文件中移动,但我还没有能够让它工作。

3 个答案:

答案 0 :(得分:3)

假设每一行以空格开头,它应该与前一行连接,这应该有效:

with open(data) as infile:
    previous_line = None
    for line in infile:
        if previous_line is None:
            previous_line = line
        if line.startswith(' '):
            line = previous_line.strip() + line
        previous_line = line
        print(line.strip())

答案 1 :(得分:2)

这是一个廉价,合理有效的续线连接器。

def cont_lines(source):
    last_line = ''
    for line in source:
        if line.startswith(' '):
            last_line += line.lstrip()  # append a continuation
        else:
            if last_line:
                yield last_line
            last_line = line
    if last_line:  # The one remaining as the source has ended.
        yield last_line

像这样使用:

with open("tile.csv") as f:
  for line in cont_lines(f):
     # do something with line

它只使用与文件中最长的连续行集一样多的内存。

答案 2 :(得分:0)

我能够找到一些东西。

infile = "test.txt"
def peek_line(f):
    pos = f.tell()
    line = f.readline()
    f.seek(pos)
    return line

f = open(infile, 'r')
while True:
    line = f.readline()
    if not line:
        break
    peek = peek_line(f)
    if not peek.startswith('T'):
        line = (line.strip() + f.readline())
    print line,

我愿意接受有关此方法的反馈。