Question

我的目标是从文本文件（unicode）中的段落中删除重复的单词并生成单词列表。到目前为止，我已经能够删除不需要的字符＆amp;空白。但是我很难在代码中删除重复的单词。我尝试使用set但我返回的值为null

这是我的代码。

with open ('words.txt',mode='r',encoding='utf8')as f:

   #open file and split each word
    for line in f:


        for word in line.split():

            #remove unwanted characters

            for char in ['।',',','’','‘','?']:
                if char in word:
                    word = word.replace(char,'')



              # remove blank line
            if word.strip():

                print (word)

Answer 1

在这个答案中，我将private void checkBox1_CheckedChanged(object sender, EventArgs e) { CheckBox c = (CheckBox)sender; if(checkBox1.Checked) { string lb1 = label1.Text + c.Text + "@"; lb1 = lb1.Replace("@", Environment.NewLine); label1.Text = lb1; } else { string str = c.Text + "@"; str = str.Replace("@", Environment.NewLine); label1.Text = label1.Text.Replace(str, ""); }定义为无操作函数，您可能希望删除标点符号等，因此您必须相应地定义clean

clean

在def clean(w): return w函数的帮助下，您可以使用双列表理解（在技术上，它更像是生成器表达式）在一个集合中收集文本中的唯一单词

clean()

最终你可以从集合

中删除空字符串

suw = set(clean(w) for line in open('words.txt') for w in line.split())

要迭代集合的成员（唯一的单词），请使用熟悉的suw.discard('')构造

for ... in ..:

Answer 2

我认为下面的代码非常明显。

with open("words.txt", 'r', encoding="utf-8") as f:
    for line in f:
        if line.strip():
            words = []
            duplicates = set()

            for word in line.split():
                word = word.strip()
                if word:
                    for i in ['|', ',', '’', '‘', '?']:
                        word = word.replace(i, "")  # Doesn't create an error if i isn't in the word.

                    if word in duplicates:
                        pass  # do nothing
                    elif word in words:
                        words.remove(word)
                        duplicates.add(word)
                    else:
                        words.append(word)

            print(" ".join(words))  # or just `print " ".join(words)` for python2

Answer 3

f = open('words.txt',mode='r',encoding='utf8')
text = f.read()
for char in ['।',',','’','‘','?']:
    text=text.replace(char,'')
list_of_words=list(set(text.split()))
print(list_of_words)

Answer 4

我假设您想再次在同一文本中写下这些单词，这就是代码：

# !/usr/bin/env python
# -*- coding: utf-8 -*-

clean_list = set()
with open('words.txt', mode='r') as f:
    # open file and split each word
    for line in f:

        for word in line.split():

            # remove unwanted characters
            for char in ['।', ',', '’', '‘', '?']:
                if char in word:
                    print(char)
                    word = word.replace(char, '')

            # remove blank line
            if word.strip():
                clean_list.add(word)

# open the file again
with open('words.txt', mode='w+') as f:
    # clean file
    f.truncate()
    # writing the words
    for clean_word in clean_list:
        f.write(clean_word + '\n')

如果您不想在同一个文件中写入，只需将最后一行更改为：

# open the file again
with open('new_words.txt', mode='w+') as f:
    # writing the words
for clean_word in clean_list:
    f.write(clean_word + '\n') # \n to save each word in a new line

修改

我只使用了set @Copperfield建议

问题从文本文件中删除重复

4 个答案: