我的目标是从文本文件(unicode)中的段落中删除重复的单词并生成单词列表。到目前为止,我已经能够删除不需要的字符&空白。但是我很难在代码中删除重复的单词。我尝试使用set但我返回的值为null
这是我的代码。
with open ('words.txt',mode='r',encoding='utf8')as f:
#open file and split each word
for line in f:
for word in line.split():
#remove unwanted characters
for char in ['।',',','’','‘','?']:
if char in word:
word = word.replace(char,'')
# remove blank line
if word.strip():
print (word)
答案 0 :(得分:1)
在这个答案中,我将private void checkBox1_CheckedChanged(object sender, EventArgs e)
{
CheckBox c = (CheckBox)sender;
if(checkBox1.Checked)
{
string lb1 = label1.Text + c.Text + "@";
lb1 = lb1.Replace("@", Environment.NewLine);
label1.Text = lb1;
}
else
{
string str = c.Text + "@";
str = str.Replace("@", Environment.NewLine);
label1.Text = label1.Text.Replace(str, "");
}
定义为无操作函数,您可能希望删除标点符号等,因此您必须相应地定义clean
clean
在def clean(w): return w
函数的帮助下,您可以使用双列表理解(在技术上,它更像是生成器表达式)在一个集合中收集文本中的唯一单词
clean()
最终你可以从集合
中删除空字符串suw = set(clean(w) for line in open('words.txt') for w in line.split())
要迭代集合的成员(唯一的单词),请使用熟悉的suw.discard('')
构造
for ... in ..:
答案 1 :(得分:0)
我认为下面的代码非常明显。
with open("words.txt", 'r', encoding="utf-8") as f:
for line in f:
if line.strip():
words = []
duplicates = set()
for word in line.split():
word = word.strip()
if word:
for i in ['|', ',', '’', '‘', '?']:
word = word.replace(i, "") # Doesn't create an error if i isn't in the word.
if word in duplicates:
pass # do nothing
elif word in words:
words.remove(word)
duplicates.add(word)
else:
words.append(word)
print(" ".join(words)) # or just `print " ".join(words)` for python2
答案 2 :(得分:0)
f = open('words.txt',mode='r',encoding='utf8')
text = f.read()
for char in ['।',',','’','‘','?']:
text=text.replace(char,'')
list_of_words=list(set(text.split()))
print(list_of_words)
答案 3 :(得分:0)
我假设您想再次在同一文本中写下这些单词,这就是代码:
# !/usr/bin/env python
# -*- coding: utf-8 -*-
clean_list = set()
with open('words.txt', mode='r') as f:
# open file and split each word
for line in f:
for word in line.split():
# remove unwanted characters
for char in ['।', ',', '’', '‘', '?']:
if char in word:
print(char)
word = word.replace(char, '')
# remove blank line
if word.strip():
clean_list.add(word)
# open the file again
with open('words.txt', mode='w+') as f:
# clean file
f.truncate()
# writing the words
for clean_word in clean_list:
f.write(clean_word + '\n')
如果您不想在同一个文件中写入,只需将最后一行更改为:
# open the file again
with open('new_words.txt', mode='w+') as f:
# writing the words
for clean_word in clean_list:
f.write(clean_word + '\n') # \n to save each word in a new line
修改强>
我只使用了set @Copperfield建议