问题从文本文件中删除重复

时间:2016-11-26 15:45:31

标签: python python-3.5

我的目标是从文本文件(unicode)中的段落中删除重复的单词并生成单词列表。到目前为止,我已经能够删除不需要的字符&空白。但是我很难在代码中删除重复的单词。我尝试使用set但我返回的值为null

这是我的代码。

with open ('words.txt',mode='r',encoding='utf8')as f:

   #open file and split each word
    for line in f:


        for word in line.split():

            #remove unwanted characters

            for char in ['।',',','’','‘','?']:
                if char in word:
                    word = word.replace(char,'')



              # remove blank line
            if word.strip():

                print (word)

4 个答案:

答案 0 :(得分:1)

在这个答案中,我将private void checkBox1_CheckedChanged(object sender, EventArgs e) { CheckBox c = (CheckBox)sender; if(checkBox1.Checked) { string lb1 = label1.Text + c.Text + "@"; lb1 = lb1.Replace("@", Environment.NewLine); label1.Text = lb1; } else { string str = c.Text + "@"; str = str.Replace("@", Environment.NewLine); label1.Text = label1.Text.Replace(str, ""); } 定义为无操作函数,您可能希望删除标点符号等,因此您必须相应地定义clean

clean

def clean(w): return w 函数的帮助下,您可以使用双列表理解(在技术上,它更像是生成器表达式)在一个集合中收集文本中的唯一单词

clean()

最终你可以从集合

中删除空字符串
suw = set(clean(w) for line in open('words.txt') for w in line.split())

要迭代集合的成员(唯一的单词),请使用熟悉的suw.discard('') 构造

for ... in ..:

答案 1 :(得分:0)

我认为下面的代码非常明显。

with open("words.txt", 'r', encoding="utf-8") as f:
    for line in f:
        if line.strip():
            words = []
            duplicates = set()

            for word in line.split():
                word = word.strip()
                if word:
                    for i in ['|', ',', '’', '‘', '?']:
                        word = word.replace(i, "")  # Doesn't create an error if i isn't in the word.

                    if word in duplicates:
                        pass  # do nothing
                    elif word in words:
                        words.remove(word)
                        duplicates.add(word)
                    else:
                        words.append(word)

            print(" ".join(words))  # or just `print " ".join(words)` for python2

答案 2 :(得分:0)

f = open('words.txt',mode='r',encoding='utf8')
text = f.read()
for char in ['।',',','’','‘','?']:
    text=text.replace(char,'')
list_of_words=list(set(text.split()))
print(list_of_words)

答案 3 :(得分:0)

我假设您想再次在同一文本中写下这些单词,这就是代码:

# !/usr/bin/env python
# -*- coding: utf-8 -*-

clean_list = set()
with open('words.txt', mode='r') as f:
    # open file and split each word
    for line in f:

        for word in line.split():

            # remove unwanted characters
            for char in ['।', ',', '’', '‘', '?']:
                if char in word:
                    print(char)
                    word = word.replace(char, '')

            # remove blank line
            if word.strip():
                clean_list.add(word)

# open the file again
with open('words.txt', mode='w+') as f:
    # clean file
    f.truncate()
    # writing the words
    for clean_word in clean_list:
        f.write(clean_word + '\n')

如果您不想在同一个文件中写入,只需将最后一行更改为:

# open the file again
with open('new_words.txt', mode='w+') as f:
    # writing the words
for clean_word in clean_list:
    f.write(clean_word + '\n') # \n to save each word in a new line

修改

我只使用了set @Copperfield建议