Question

我是python的新手，我正在尝试找到alice_in_worderland.txt中最大的单词。我想我有一个很好的系统设置（＆＃34;见下文＆＃34;），但我的输出正在返回一个＆＃34;字＆＃34;用破折号连接多个单词。有没有办法删除文件输入中的破折号？对于文本文件，请访问here

来自文本文件的示例：

非常重要，＆＃39;国王说，转向陪审团。他们是刚刚开始在他们的石板上写下这个，当白色兔子被打断：非常重要，陛下的意思当然是＆＃39;他他以一种非常尊重的语气说，但皱着眉头，对他说话他说话的时候。＆＃34;当然，重要的是，我的意思是，＆＃39;国王匆匆忙忙说，然后继续自言自语，重要 - 不重要 - 不重要 - 重要 - ＆＃39;好像他在试着说出哪个词。最好的＆＃34;

代码：

    #String input
    with open("alice_in_wonderland.txt", "r") as myfile:
        string=myfile.read().replace('\n','')
    #initialize list
    my_list = []
    #Split words into list
    for word in string.split(' '):
        my_list.append(word)
    #initialize list
    uniqueWords = []
    #Fill in new list with unique words to shorten final printout
    for i in my_list:
        if not i in uniqueWords:
            uniqueWords.append(i)
    #Legnth of longest word
    count = 0
    #Longest word place holder
    longest = []
    for word in uniqueWords:
        if len(word)>count:
            longest = word
            count = len(longest)
        print longest

Answer 1

>>> import nltk # pip install nltk
>>> nltk.download('gutenberg')
>>> words = nltk.corpus.gutenberg.words('carroll-alice.txt')
>>> max(words, key=len) # find the longest word
'disappointment'

Answer 2

以下是使用re和mmap的一种方式：

import re
import mmap

with open('your alice in wonderland file') as fin:
    mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
    words = re.finditer('\w+', mf)
    print max((word.group() for word in words), key=len)

# disappointment

比将文件加载到物理内存更有效。

Answer 3

使用str.replace用空格（或任何你想要的）替换短划线。要做到这一点，只需在第3行第一次调用后添加另一个要替换的调用：

string=myfile.read().replace('\n','').replace('-', ' ')

在Python中将文本文件转换为字符串

3 个答案: