将文本文件中的唯一单词添加到python中的列表中

时间:2016-06-16 10:52:01

标签: python python-2.7

假设我有以下文本文件:

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

我想将此文件中的所有唯一字词添加到列表中

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word in lst: continue
        lst = lst + words
    lst.sort()
print lst

但该计划的选择如下:

['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 
'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 
'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 
'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 
'window', 'with', 'yonder']

'和'以及其他一些单词在列表中多次出现。我应该改变循环的哪一部分,以便我没有任何重复的单词?谢谢!

8 个答案:

答案 0 :(得分:6)

以下是您的代码存在的问题,更正后的版本如下:

fname = open("romeo.txt")      # better to open files in a `with` statement
lst = list()                   # lst = [] is more Pythonic
for line in fname:
    line = line.rstrip()       # not required, `split()` will do this anyway
    words = line.split(' ')    # don't specify a delimiter, `line.split()` will split on all white space
    for word in words:
        if word in lst: continue
        lst = lst + words      # this is the reason that you end up with duplicates... words is the list of all words for this line!
    lst.sort()                 # don't sort in the for loop, just once afterwards.
print lst

所以它几乎可以使用,但是,您应该只将当前word附加到列表中,而不是从words行中获得的所有split()。如果您只是更改了行:

lst = lst + words

lst.append(word)

它会起作用。

以下是更正后的版本:

with open("romeo.txt") as infile:
    lst = []
    for line in infile:
        words = line.split()
        for word in words:
            if word not in lst:
                lst.append(word)    # append only this word to the list, not all words on this line
    lst.sort()
    print(lst)

正如其他人所建议的那样,set是解决这个问题的好方法。这很简单:

with open('romeo.txt') as infile:
    print(sorted(set(infile.read().split())))

使用sorted()您无需保留对列表的引用。如果您确实想在其他地方使用排序列表,请执行以下操作:

with open('romeo.txt') as infile:
    unique_words = sorted(set(infile.read().split()))
    print(unique_words)

将整个文件读入内存可能不适用于大文件。您可以使用生成器来有效地读取文件,而不会使主代码混乱。此生成器将一次读取一行文件,它将一次生成一个单词。它不会一次读取整个文件,除非文件包含一个长行(您的样本数据显然没有):

def get_words(f):
    for line in f:
        for word in line.split():
            yield word

with open('romeo.txt') as infile:
    unique_words = sorted(set(get_words(infile)))

答案 1 :(得分:4)

使用集合在python中更容易:

if self.imageView.image?.imageOrientation == .Left || self.imageView.image?.imageOrientation == .Right {
            self.isLandscape=true
        }

如果您想要一个列表,请在之后进行转换:

with open("romeo.txt") as f:
     unique_words = set(f.read().split())

可能很高兴让它们按字母顺序排列:

 unique_words = list(unique_words) 

答案 2 :(得分:2)

有几种方法可以达到你想要的效果 1)使用列表:

fname = open("romeo.txt")
lst = list()
for word in fname.read().split(): # This will split by all whitespace, meaning that it will spilt by ' ' and '\n'
    if word not in lst:
        lst.append(word)
lst.sort()
print lst

2)使用集合:

fname = open("romeo.txt")
lst = list(set(fname.read().split()))
lst.sort()
print lst

设置只是忽略重复项,因此检查是不必要的

答案 3 :(得分:1)

如果您想获得一组唯一字词,最好使用set,而不是list,因为in lst效率可能非常低。

对于单词计算,最好使用Counter object

答案 4 :(得分:1)

我愿意:

with open('romeo.txt') as fname:
    text = fname.read()
    lst = list(set(text.split()))
    print lst


>> ['and', 'envious', 'already', 'fair', 'is', 'through', 'pale', 'yonder', 'what', 'sun', 'Who', 'But', 'moon', 'window', 'sick', 'east', 'breaks', 'grief', 'with', 'light', 'It', 'Arise', 'kill', 'the', 'soft', 'Juliet']

答案 5 :(得分:0)

使用word代替words(也简化了循环)

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word not in lst:
            lst.append(word)
    lst.sort()
print lst

或者将[word]+运算符

一起使用
fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word in lst: continue
        lst = lst + [word]
    lst.sort()
print lst

答案 6 :(得分:0)

import string
with open("romeo.txt") as file:
    lst = []
    uniquewords = open('romeo_unique.txt', 'w') # opens the file
    for line in file:
        words = line.split()
        for word in words: # loops through all words
            word = word.translate(str.maketrans('', '', string.punctuation)).lower()
            if word not in lst:
                lst.append(word)    # append only this unique word to the list
                uniquewords.write(str(word) + '\n') # write the unique word to the file

答案 7 :(得分:-1)

您需要更改

class MyForm(forms.ModelForm): def label_from_instance(self, obj): return "My Object #%i" % obj.id def __init__(self, *args, **kwargs): super(MyForm, self).__init__(*args, **kwargs) self.fields['my_multi_choice_field'].label_from_instance = self.label_from_instance lst = lst + words

如果您需要唯一字词,则需要在列表中添加lst.append(word)而不是word(这是所有字词)。