几乎相同的重复但长度不同

时间:2018-05-21 11:56:51

标签: python text duplicates

我想删除几乎相同的副本,但只保留最长的副本。我想首先比较第一个单词或前几个单词来过滤掉候选者以进行比较。然后比较剩余元素的长度。如果它是最长的,我会将其写入新的文本文件。 这是测试文件https://drive.google.com/file/d/1tdewlNtIqBMaldgrUr02kbCKDyndXbSQ/view?usp=sharing

输入

I am Harry.
I am Harry. I like 
I am Harry. I like to eat apple.
I am Garry.
I am Garry. I am Hap
I am Garry. I am Happy.

输出

I am Harry. I like to eat apple.
I am Garry. I am Happy.

我正在用Python做这件事,但事情就是赢了。

代码

f1 = open('a.txt','r') # Read from file
ListofLine = f1.readlines() # Read the line into list
f2 = open('n.txt','w') # Open new file to write

# Iterate all the sentences to compare
for x in len(ListofLine):
    # Comparing first word of the sentences
    if(ListofLine[x].split()[0] = ListofLine[x+1].split()[0]):
        # Comparing the length and keep the longest length sentences
        if(len(ListofLine[x])>len(ListofLine[x+1])):
            f2.write(ListofLine[x])

f1.close()   
f2.close()

3 个答案:

答案 0 :(得分:1)

您需要定义一个条件才能找到您称之为公共部分的内容。它可以是第一句话,例如“我是哈利。”

要解析句子,您可以使用RegEx,例如:

import re


# match a sentence finishing by a dot
re_sentence = r'((?:(?!\.|$).)+\.?)\s*'
find_all_sentences = re.compile(re_sentence, flags=re.DOTALL).findall

这里 find_all_sentences 是一个功能。它是re.compile findall 函数的结果。找到一行中的所有句子是一个帮助。

定义此函数后,您可以使用它来解析行并提取第一个被视为要检查的公共部分的句子。

每当你匹配一个句子时,你可以将它存储在 dict 中(这里我使用了 OrdererdDict 来保持行的顺序)。当然,如果你找到一个更长的行,你可以用这个替换现有的行:

import collections

lines = [
    "I am Harry. I like to eat apple",
    "I am Harry.",
    "I am Garry.",
    "I am Garry. I am Happy."]

longuest = collections.OrderedDict()
for line in lines:
    sentences = find_all_sentences(line)
    first = sentences[0]
    if first in longuest:
        longuest[first] = max([longuest[first], line], key=lambda l: len(l))
    else:
        longuest[first] = line

最后,您可以将结果序列化为文件。或打印出来:

for line in longuest.values():
    print(line)

要编写文件,请使用 with 语句:

import io


out_path = 'path/to/sentences.txt'

with io.open(out_path, mode='w', encoding='utf-8') as f:
    for line in longuest.values():
        print(line, file=f)

答案 1 :(得分:0)

尽力而为:

欺骗是不计算新字符串(或行)的全长,并使用 startswith ()将已检查的字符串作为前缀进行匹配。有了这个功能,你就会在你得到一条比前一条线更长(+ 1)线的那一刻停下来,这就是所有事情。

ListofLine=["I am Harry.",
"I am Harry. I like to eat apple.",
"I am Garry.",
"I am Garry. I am Happy."]
list=[]   # to contain the longest ones

for line in ListofLine:  # ListofLine are basically the input lines
    found = False
    for k in list:  
        if line.startswith(k):
            list.remove(k)  # removes relatively smaller one
            list.append(line) # add the longer one instead
            found= True
            break
    if found == False: list.append(line)
for item in list:
    print item

最后,列表将包含最长的项目。

https://www.jdoodle.com/embed/v0/vIG

打印:

I am Harry. I like to eat apple.
I am Garry. I am Happy.

答案 2 :(得分:0)

如果您可以定义一个将每一行映射到不同类的函数,则可以使用itertools.groupby

例如,假设两个字符串相似,如果它们具有相同的10个起始字符。

data = """I am Harry.
I am Harry. I like
I am Harry. I like to eat apple.
I am Garry.
I am Garry. I am Hap
I am Garry. I am Happy.""".split('\n')

from itertools import groupby
criterion = lambda s: s[:10]

result = [max(g[1], key=len) for g in groupby(data, criterion)]
# ['I am Harry. I like to eat apple.', 'I am Garry. I am Happy.']