如何在python中清理文本文件?

时间:2011-04-26 18:02:36

标签: python text

我的文件中的文字如下:

text1 5,000 6,000
text2 2,000 3,000
text3 
           5,000 3,000
text4 1,000 2000
text5
          7,000 1,000
text6 2,000 1,000

有没有办法在Python中清除它,以便在文本行后面有缺少的数字时,后续行上的数字可以放在上面的行上:

text1 5,000 6,000
text2 2,000 3,000
text3 5,000 3,000
text4 1,000 2000
text5 7,000 1,000
text6 2,000 1,000

谢谢!

2 个答案:

答案 0 :(得分:3)

假设每行应该有三个“单词”,你可以使用

tokens = (x for line in open("file") for x in line.split())
for t in zip(tokens, tokens, tokens):
    print str.join(" ", t)

编辑:由于上述先决条件显然不成立,这是一个实际查看数据的实现:

from itertools import groupby
tokens = (x for line in open("file") for x in line.split())
for key, it in groupby(tokens, lambda x: x[0].isdigit()):
    if key:
        print str.join(" ", it)
    else:
        print str.join("\n", it),

答案 1 :(得分:1)

假设逻辑行在以空格开头的行上“继续”(并包含任意数量的记录),您可以使用:

>>> collapse_space = lambda s: str.join(" ", s.split())
>>>
>>> logical_lines = []
>>> for line in open("text"):
...   if line[0].isspace():
...     logical_lines[-1] += line #-- append the continuation to the last logical line
...   else:
...     logical_lines.append(line) #-- start a new logical line
... 
>>> l = map(collapse_space, logical_lines)
>>>
>>> print str.join("\n", l)
text1 5,000 6,000
text2 2,000 3,000
text3 5,000 3,000
text4 1,000 2000
text5 7,000 1,000
text6 2,000 1,000