过滤类似行的文本文件

时间:2017-07-21 09:29:46

标签: python text-processing similarity python-textprocessing

在包含大量行的文本文件中,我需要提取所有以相似单词开头并且不唯一的行。 我寻找那些开头相同的行 - 它们可能具有相同的内容(重复的行)或略有不同的内容(在第一个单词之后)。我希望这个例子能够解释它。这将是这样一个文件的一个例子:

hungarian-american
hungarian-german lied
hungarian-german
hungarian-speaking areas
hungarian-speaking regions
hungarica
hungary
hungary and slovakia
hungary and slovakia
hungry i
hunnis, william
hunt, l.

我正在寻找这些界限:

hungarian-american
hungarian-german lied ms
hungarian-german ms
hungarian-speaking areas
hungarian-speaking regions
hungary
hungary and slovakia
hungary and slovakia

此示例中已弃用

hungarica
hungry i
hunnis, william
hunt, l.

因为它们是独一无二的(不要以类似的词开头)。

我怎样才能尝试解决这个问题?我对Python和正则表达式有点熟悉,但也许有更简单的灵魂?谢谢你的帮助!

1 个答案:

答案 0 :(得分:1)

这应该可以解决问题:

import re
from collections import defaultdict

dic = defaultdict(list)

lines = """hungarian-american
hungarian-german lied
hungarian-german
hungarian-speaking areas
hungarian-speaking regions
hungarica
hungary
hungary and slovakia
hungary and slovakia
hungry i
hunnis, william
hunt, l.""".split('\n')

for line in lines:
    # you should preferably use a word tokenizer such as the ones availables in NTLK
    # but this line gives the idea
    try:
        first_word = re.split(',|;|-|\s', line)[0]
    except IndexError:
        continue
    # Grouping similar lines
    dic[first_word].append(line)

# Showing only similar lines which are not unique :
for word, lst in dic.items():
    if len(lst) > 1:
        print '\n'.join(lst)