python:删除重复的文本行组

时间:2017-12-23 22:18:25

标签: python-3.x text duplicates

我知道如何从文本中删除重复的行和重复的字符,但我正在尝试在python3中完成更复杂的操作。我的文本文件可能包含也可能不包含在每个文本文件中重复的行组。我想编写一个python实用程序,它将找到这些重复的行块并删除除第一个找到的所有行。

例如,假设file1包含此数据:

Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.

And this is even different stuff.

Now is the time
for all good men
to come to the aid of their party.

Now is the time
for all good men
to come to the aid of their party.

That's all, folks.

我希望以下是这种转变的结果:

Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.

And this is even different stuff.




That's all, folks.

我还希望在从文件开头以外的某处开始找到重复的行组时使用此功能。假设file2看起来像这样:

This is some text.

This is some other text,
as is this.

All around
the mulberry bush
the monkey chased the weasel.

Here is some more random stuff.
All around
the mulberry bush
the monkey chased the weasel.
... and this is another phrase.

All around
the mulberry bush
the monkey chased the weasel.

End

对于file2,这应该是转换的结果:

This is some text.

This is some other text,
as is this.

All around
the mulberry bush
the monkey chased the weasel.

Here is some more random stuff.
... and this is another phrase.


End

要清楚,在运行此所需实用程序之前,可能不知道可能重复的行组。该算法必须自己识别这些重复的行组。

我确信只要有足够的工作和足够的时间,我终于可以提出我正在寻找的算法。但我希望有人可能已经解决了这个问题,并将结果公布在某个地方。我一直在寻找,但没有找到任何东西,但也许我忽略了一些东西。

ADDENDUM:我需要增加更多清晰度。行组必须是最大的组,每组必须至少包含2行。

例如,假设file3如下所示:

line1 line1 line1
line2 line2 line2
line3 line3 line3

other stuff

line1 line1 line1
line3 line3 line3
line2 line2 line2

在这种情况下,所需的算法不会删除任何行。

另一个例子,在file4

abc def ghi
jkl mno pqr

line1 line1 line1
line2 line2 line2
line3 line3 line3
abc def ghi
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
qwerty
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
asdfghj
line1 line1 line1
line2 line2 line2
line3 line3 line3
lkjhgfd
line2 line2 line2
line3 line3 line3
line4 line4 line4
wxyz

我正在寻找的结果是:

abc def ghi
jkl mno pqr

line1 line1 line1
line2 line2 line2
line3 line3 line3
abc def ghi
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
qwerty
asdfghj
line1 line1 line1
line2 line2 line2
line3 line3 line3
lkjhgfd
line2 line2 line2
line3 line3 line3
line4 line4 line4
wxyz

换句话说,由于4行组(“line1 ... line2 ... line3 ... line4 ...”)是最大的重复组,因此这是唯一被删除的组

我总是可以重复这个过程,直到文件不变,如果我想要删除较小的重复组。

2 个答案:

答案 0 :(得分:0)

我提出了以下解决方案。它可能仍然有一些未解决的边缘情况,它可能不是最有效的方法,但至少在我的初步测试后,它似乎工作。

此转发版已经修复了我最初提交的版本中的一些错误。

欢迎任何改进建议。

# Remove all but the first occurrence of the longest                                                                            
# duplicated group of lines from a block of text.
# In this utility, a "group" of lines is considered
# to be two or more consecutive lines.                                                                             
#                                                                                                                               
# Much of this code has been shamelessly stolen from                                                                            
# https://programmingpraxis.com/2010/12/14/longest-duplicated-substring/                                                        

import sys

from itertools import starmap, takewhile, tee
from operator import eq, truth

# imap and izip no longer exist in python3 itertools.                                                                           
# These are simply equivalent to map and zip in python3.                                                                        
try:
    # python2 ...
    from itertools import imap
except ImportError:
    # python3 ...
    imap = map
try:
    # python2 ...
    from itertools import izip
except ImportError:
    # python3 ...
    izip = zip

def remove_longest_dup_line_group(text):
    if not text:
        return ''
    # Unlike in the original code, here we're dealing                                                                           
    # with groups of whole lines instead of strings                                                                              
    # (groups of characters). So we split the incoming                                                                          
    # data into a list of lines, and we then apply the                                                                          
    # algorithm to these lines, treating a line in the
    # same way that the original algorithm treats an
    # individual character.                                                                                                       
    lines = text.split('\n')
    ld = longest_duplicate(lines)
    if not ld:
        return text
    tokens = text.split(ld)
    if len(tokens) < 1:
        # Defensive programming: this shouldn't ever happen,                                                                    
        # but just in case ...                                                                                                  
        return text
    return '{}{}{}'.format(tokens[0], ld, ''.join(tokens[1:]))

def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return izip(a,b)

def prefix(a, b):
    count = sum(takewhile(truth, imap(eq, a, b)))
    if count < 2:
        # Blocks must consist of more than one line.
        return ''
    else:
        return '{}\n'.format('\n'.join(a[:count]))

def longest_duplicate(s):
    suffixes = (s[n:] for n in range(len(s)))
    return max(starmap(prefix, pairwise(sorted(suffixes))), key=len)

if __name__ == '__main__':
    text = sys.stdin.read()
    if text:
        # Use sys.stdout.write instead of print to
        # avoid adding an extra newline at the end.
        sys.stdout.write(remove_longest_dup_line_group(text))
    sys.exit(0)

答案 1 :(得分:0)

快速而肮脏,未针对边缘情况进行测试:

 rooms = await Rooms.find({ $text: { $search: 'john' } })
        .populate({ path: "teacher", match: { $text: { $search: 'john' } } })
        .populate({ path: "student", match: { $text: { $search: 'john' } } })

输出:

#!/usr/bin/env python3

from pathlib import Path

TEXT = '''Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.

And this is even different stuff.

Now is the time
for all good men
to come to the aid of their party.

Now is the time
for all good men
to come to the aid of their party.

That's all, folks.'''

def remove_duplicate_blocks(lines):
    num_lines = len(lines)

    for idx_start in range(num_lines):
        idx_end = num_lines

        for idx in range(idx_end, -1, -1):
            if idx_start < idx:
                dup_candidate_block = lines[idx_start + 1: idx]
                len_dup_block = len(dup_candidate_block)
                if len_dup_block and len_dup_block < int(num_lines / 2):
                    for scan_idx in range(idx):
                        if ((idx_start + 1) > scan_idx
                                and dup_candidate_block == lines[scan_idx: scan_idx + len_dup_block]):
                            lines[idx_start + 1: idx] = []
                            return remove_duplicate_blocks(lines)
    return lines


if __name__ == '__main__':
    clean_lines = remove_duplicate_blocks(TEXT.split('\n'))
    print('\n'.join(clean_lines))