Question

这是一个文件，其中许多行是其他行的子字符串。如何过滤它以仅包含每行的最长版本？

buffer not
buffer not available
code 000001
error pxa_no_shared_memory
error pxa_no_shared_memory occurred
error pxa_no_shared_memory occurred short
error pxa_no_shared_memory occurred short dump
failed return
failed return code
failed return code 000001
for pxa
for pxa buffer
for pxa buffer not
for pxa buffer not available
initialization runt
initialization runt failed
initialization runt failed return
initialization runt failed return code
initialization runt failed return code 000001
memory for
memory for pxa
memory for pxa buffer
memory for pxa buffer not
memory for pxa buffer not available
not available
occurred short
occurred short dump

如果短语出现在较长的短语中，如“缓冲区不可用”也会出现在“缓冲区不可用”和“pxa缓冲区的内存不可用”中，我想保留“pxa缓冲区的内存不可用”。

输出应该是包含所有最长错误消息的文本文件。像这样：

error pxa_no_shared_memory occurred short dump
initialization runt failed return code 000001
memory for pxa buffer not available

Answer 1

不确定效率，但是：

with open('lines.txt') as f:
    original = f.read().splitlines()
    results = set(original)
    for o in original:
        for r in set(results):
            if o != r:
                try:
                    if o in r:
                        results.remove(o)
                    elif r in o:
                        results.remove(r)
                except KeyError:
                    pass

print('\n'.join(results))

Answer 2

这个怎么样：

phrases = '''buffer not
buffer not available
code 000001
error pxa_no_shared_memory
error pxa_no_shared_memory occurred
error pxa_no_shared_memory occurred short
error pxa_no_shared_memory occurred short dump
failed return
failed return code
failed return code 000001
for pxa
for pxa buffer
for pxa buffer not
for pxa buffer not available
initialization runt
initialization runt failed
initialization runt failed return
initialization runt failed return code
initialization runt failed return code 000001
memory for
memory for pxa
memory for pxa buffer
memory for pxa buffer not
memory for pxa buffer not available
not available
occurred short
occurred short dump'''.split("\n")

""" We want to find only the longest versions of each line """

results = []
for phrase in phrases:
    found = -1
    # check to see if there is a version of this phrase we've encountered already
    for i, r in enumerate(results):
        # if our new phrase is longer then replace the existing version
        if phrase.startswith(r) and len(phrase) > len(r):
            found = i
            break
        # if the existing version is longer than do nothing
        elif r.startswith(phrase):
            found = -2
            break
    if found == -2:
        continue
    elif found > -1:
        results[found] = phrase
    else:
        # otherwise it must be a new phrase
        results.append(phrase)

不完全优雅，但完成工作。

如何从文件中删除作为其他行的子字符串的行

2 个答案: