解析文件以收集包含后续内容的节标题

时间:2014-06-19 19:30:27

标签: python debugging python-2.x text-processing

我需要将输入文件的行(以大写字母)合并为一行,如下所示:

文件1-INP

=4455
AAAAAAAAAA
BBBBBBBBBBB
CCCCCCCCCC
=3433
GGGGGGGGGGGG
DDDDDDDDDDD
EEEEEEEEEEE
=44543
FFFFFFFFFFFFF
HHHHHHHHHHHHH

预期输出

=4455
AAAAAAAAAABBBBBBBBBBB
CCCCCCCCCC
=3433
GGGGGGGGGGGGDDDDDDDDDDDEEEEEEEEEEE
=44543
FFFFFFFFFFFFFHHHHHHHHHHHHH

我的代码

fp=open("file1","r")
a=[]
for line in fp:
    if line[0]=="=":
        print line.strip()
        print "".join(a)
        a=[]
    else:
        a.append(line.strip())

实际输出

=4455

=3433
AAAAAAAAAABBBBBBBBBBB
CCCCCCCCCC
=44543
GGGGGGGGGGGGDDDDDDDDDDDEEEEEEEEEEE

我知道它非常愚蠢,但有人可以帮我解决我的代码中的问题吗?

4 个答案:

答案 0 :(得分:3)

您的问题是您在"".join(a)之后打印line.strip(),而不是之前。修正版:

a = []
fp=open("file1","r")
for line in fp:
    if line[0]=="=":
        if a:  #  prevent printing a blank line at the start
            print "".join(a)
        print line.strip()
        a=[]
    else:
        a.append(line.strip())
print "".join(a)

(在循环之前将a初始化,最后打印a的最终内容。)

答案 1 :(得分:3)

不是在循环内打印,而是只是累积要打印的所有内容并在最后输出。如果您看到标题行,请将其追加并开始累积行。当您看到下一个标题时,请附加连接的行和下一个标题等

with open('file1') as f:
    lines = f.read().splitlines()

out = []  # will accumulate lines to be output
items = []  # will accumulate lines between headers

for line in lines:
    line = line.strip()

    if not line:  # ignore blank lines
        continue

    if line.startswith('='): # new header, join the accumulated items
        if items:  # don't add a blank line if no lines were accumulated
            out.append(''.join(items))

        out.append(line)  # accumulate new header
        items = []

        continue

    items.append(line)  # accumulate non-header lines

if items:  # handle last accumulated items
    out.append(''.join(items))

print '\n'.join(out)  # out is now a list of header, joined lines, header...

答案 2 :(得分:3)

如果逻辑变得更复杂,可能更容易阅读和维护的替代方法 - 在for循环期间构建一个dict,然后打印(或者其他任何逻辑):

fp=open("file1","r")
mydict = {}

for line in fp:
    if line[0]=="=":
        key = line.strip()
    else:
        mydict.setdefault(key,[]).append(line.strip())

for key, value in mydict.iteritems():
    print key
    print "".join(value)

值得注意:这种方法会(可能)影响输出期间节的顺序,因为标准的Python字典不保证键的顺序。如果您使用的是Python 2.7或更高版本,则可以使用OrderedDict代替,它会保留第一次插入密钥的顺序,并且是dict的子类,因此可以无缝地交换。 / p>

答案 3 :(得分:0)

TXR

@(repeat)
=@blah
@  (collect)
@lines
@  (until)
=@/.*/
@  (end)
@  (cat lines "")
@  (output)
=@blah
@lines
@  (end)
@(end)

执行命令

$ txr data.txr data
=4455
AAAAAAAAAABBBBBBBBBBBCCCCCCCCCC
=3433
GGGGGGGGGGGGDDDDDDDDDDDEEEEEEEEEEE
=44543
FFFFFFFFFFFFFHHHHHHHHHHHHH

TXR Lisp

$ txr -t '[mapcar cat-str (partition-by (opip first (= #\=)) (get-lines))]' < data
=4455
AAAAAAAAAABBBBBBBBBBBCCCCCCCCCC
=3433
GGGGGGGGGGGGDDDDDDDDDDDEEEEEEEEEEE
=44543
FFFFFFFFFFFFFHHHHHHHHHHHHH

awk中:

/=.*/   { printf("%s", out);
          blah = $0; line = ""; next }
        { line = line $0
          out = blah "\n" line "\n" }
END     { printf("%s", out); }

执行命令

$ awk -f data.awk data
=4455
AAAAAAAAAABBBBBBBBBBBCCCCCCCCCC
=3433
GGGGGGGGGGGGDDDDDDDDDDDEEEEEEEEEEE
=44543
FFFFFFFFFFFFFHHHHHHHHHHHHH