我试图计算使用Python3在.txt文件中出现单词的时间

时间:2016-04-06 21:15:38

标签: python string python-3.x

我正在尝试计算单词出现在txt文件中的次数。该程序似乎工作,但我无法阻止它计算我认为是白色空间(我的结果中的60,这没有任何意义,因为有超过60个空格)。是否有一种剥离的方式 - 并且 - 从单词的中间?

import string

words = {}

def unique_words2(filename):
    strip = string.whitespace + string.punctuation + string.digits + "\"'"
    for line in open(filename):
        for word in line.lower().split():
            if word == " ":
                continue
            else:
                word = word.strip(strip)
                words[word] = words.get(word, 0) + 1
    for word in sorted(words):
        print("{0} {1}".format(word, words[word]))

unique_words2("alice.txt")

前5个结果显示;

 60
a 627
a--i'm 1
a-piece 1
abide 1

我希望消除像1,3和4这样的结果。

2 个答案:

答案 0 :(得分:0)

python字符串的strip方法仅从字符串的开头和结尾删除指定的字符。使用translate方法可以解决这个问题。 (这是输出3和4的原因)。输出一是由不同的问题引起的。如果出现strip中由完全字符组成的单词,它将包含在空字符串下的单词字典中。

调整代码:

import string
def unique_words2(filename):
    words = {}
    strip = string.whitespace + string.punctuation + string.digits + "\"'"
    translation = {ord(bad):None for bad in strip} 
    for line in open(filename):
        for word in line.lower().split():
             word = word.translate(translation)
             if word:
                 words[word] = words.get(word, 0) + 1
    for word in sorted(words):
        print("{0} {1}".format(word, words[word]))

unique_words2("alice.txt")

答案 1 :(得分:0)

来自https://docs.python.org/2/library/string.html

string.split(s[, sep[, maxsplit]])

单词由空格字符(空格,制表符,换行符,返回表格,换页符)的任意字符串分隔

用空格替换任何其他分隔符(如' - ')应该可以解决问题。无需处理重复的空间,因为它们将被视为单个空间。

def unique_words2(filename):
    strip = string.whitespace + string.punctuation + string.digits + "\"'"
    for line in open(filename):
        separators = '-_|'
        for sep in seperators:
            line = line.replace(sep, ' ')

        for word in line.lower().split():
            word = word.strip(strip)
            if word:
                words[word] = words.get(word, 0) + 1
    for word in sorted(words):
        print("{0} {1}".format(word, words[word]))