Question

每当我尝试运行此程序时，Python IDLE会通过告诉我它没有响应而必须关闭来做出响应。有关如何改进此代码以使其按我想要的方式工作的任何建议吗？

#open text document
#filter out words in the document by appending to an empty list
#get rid of words that show up more than once
#get rid of words that aren't all lowercase
#get rid of words that end in substring 'xx'
#get rid of words that are less than 5 characters
#print list

fin = open('example.txt')
L = []
for word in fin:
    if len(word) >= 5:
        L.append(word)
    if word != word:
        L.append(word)
    if word[-2:-1] != 'xx':
        L.append(word)
    if word == word.lower():
        L.append(word)
print L

Answer 1

一些一般帮助：

而不是

fin = open('example.txt')

你应该使用

with open('example.txt', 'r') as fin:

然后缩进代码的其余部分，但您的版本将有效。

L = []
for word in fin:

它不会逐字迭代，而是通过行进行迭代。如果每行有一个单词，则每个单词最后仍会有换行符，所以你应该

word = word.rstrip()

在单词结束后清除任何空格。如果您确实希望一次只执行一个单词，则需要两个 for循环，例如：

for line in fin:
    for word in line.split():

然后将逻辑放在内部循环中。

if len(word) >= 5:
    L.append(word)

通过剥离空格，可以在列表中添加任意五个字母或更长的单词。

if word != word:
    L.append(word)

word 总是等于单词，所以这没有任何作用。如果您想消除重复项，请将L设为set()并使用L.add(word)代替L.append(word)，以便添加到列表中的字词（假设订单无关紧要）

if word[-2:-1] != 'xx':
    L.append(word)

如果您要查看它是否以'xx'结尾，请使用

if not word.endswith('xx'):

相反，或word[-2:]没有-1，否则你只是比较倒数第二个字母而不是整个字母。

if word == word.lower():
    L.append(word)

如果单词全部为小写，则会将单词添加到列表中。

请注意，所有这些if测试都将应用于每个单词，因此，对于每次传递的测试，您都会将该单词添加到列表中。如果您只想添加一次，则可以使用elif代替if进行除第一项以外的所有测试。

你的评论也暗示你通过将它们添加到列表中来“摆脱”单词 - 你不是。你保留你添加到列表中的那些，其余的就消失了;你没有以任何方式改变文件。

Answer 2

import re

def by_words(it):
    pat = re.compile('\w+')
    for line in it:
        for word in pat.findall(line):
            yield word

def keepers(it):
     words = set()
     for s in it:
         if len(s)>=5 and s==s.lower() and not s.endswith('xx'):
             words.add(s)
     return list(words)

从战争与和平中得到5个字：

from urllib import urlopen
source = urlopen('http://www.gutenberg.org/ebooks/2600.txt.utf8')
print keepers(by_words(source))[:5]

打印

['raining', 'divinely', 'hordes', 'nunnery', 'parallelogram']

这不需要太多记忆。战争与和平只有14,361个单词符合你的标准。迭代器可以在很小的块上运行。

Answer 3

我为你做了功课，我很无聊。可能有一个错误。

homework_a_plus = []
#open text document
with open('example.txt', 'r') as fin:
    for word in fin:
        #get rid of words that show up more than once
        if word in homework_a_plus:
            continue
        #get rid of words that aren't all lowercase
        for c in word:
            if c.isupper():
                continue
        #get rid of words that end in substring 'xx'
        if word[-2:] == 'xx':
            continue
        #get rid of words that are less than 5 characters
        if len(word) < 5:
            continue
        homework_a_plus.append(word)
print homework_a_plus

编辑：像Wooble说的那样，你提供的代码中的逻辑已经过时了。将您的代码与我的代码进行比较，我认为您将理解为什么您的代码存在问题。

Answer 4

words = [inner for outer in [line.split() for line in open('example.txt')] for inner in outer]

for word in words[:]:
    if words.count(word) > 1 or word.lower() != word or word[-2:] == 'xx' or len(word) < 5:
        words.remove(word)
print words

Answer 5

如果你想把它更多地写成一个过滤器...我会采取一种稍微不同的方法。

fin = open('example.txt','r')
seenList = []
for line in fin:
    for word in line.split():
        if word in seenList: continue
        if word[-2:] == 'xx': continue
        if word.lower() != word: continue
        if len(word) < 5: continue
        seenList.append(word)
        print word

这样可以显示每一行的输出结果。如果要输出到文件，请相应地修改print word行，或使用shell重定向。

编辑：如果你真的不想打印任何重复的单词（上面只是跳过第一个后面的每个实例），那么这样的工作......

fin = open('example.txt','r')
seenList = []
for line in fin:
    for word in line.split():
        if word in seenList: 
            seenList.remove(word)
            continue
        if word[-2:] == 'xx': continue
        if word.lower() != word: continue
        if len(word) < 5: continue
        seenList.append(word)

print seenList

Answer 6

使用正则表达式轻松实现：

import re

li = ['bubble', 'iridescent', 'approxx', 'chime',
      'Azerbaidjan', 'moon', 'astronomer', 'glue', 'bird',
      'plan_ary', 'suxx', 'moon', 'iridescent', 'magnitude',
      'Spain', 'through', 'macGregor', 'iridescent', 'ben',
      'glomoxx', 'iridescent', 'orbital']

reg1 = re.compile('(?!\S*?[A-Z_]\S*(?=\Z))'
                 '\w{5,}'
                 '(?<!xx)\Z')

print set(filter(reg1.match,li))

# result:

set(['orbital', 'astronomer', 'magnitude', 'through', 'iridescent', 'chime', 'bubble'])

如果数据不在列表中但在字符串中：

ss = '''bubble iridescent approxx chime
Azerbaidjan moon astronomer glue bird
plan_ary suxx moon iridescent magnitude
Spain through macGregor iridescent ben
glomoxx iridescent orbital'''

print set(filter(reg1.match,ss.split()))

或

reg2 = re.compile('(?:(?<=\s)|(?<=\A))'
                 '(?!\S*?[A-Z_]\S*(?=\s|\Z))'
                 '\w{5,}'
                 '(?<!xx)'
                 '(?=\s|\Z)')

print set(reg2.findall(ss))

python，从文本文档创建筛选列表

6 个答案: