Python程序:外语词频词典

时间:2014-12-10 17:43:58

标签: python dictionary

我正在尝试构建一个简单的程序,它接受一个文本文件,用单词作为键构建dict(),并将值作为每个单词出现的次数(单词频率)。

我已经了解到collections.Counter函数可以轻松地执行此操作(以及其他方法)。我的问题是,我希望字典按频率排序,以便我可以打印第N个最常用的字。最后,我还需要为字典提供一种方法,以便稍后关联不同类型的值(单词定义的字符串)。

基本上我需要输出这个的东西:

Number of words: 5
[mostfrequentword: frequency, definition]
[2ndmostfrequentword: frequency, definition]
etc.   

这是我到目前为止所做的,但它只统计单词频率,我不知道如何按频率排序字典,然后打印第N个最常用的单词:

wordlist ={}

def cleanedup(string):
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    cleantext = ''
    for character in string.lower():
        if character in alphabet:
            cleantext += character
        else:
            cleantext += ' '
    return cleantext

def text_crunch(textfile):
       for line in textfile:
            for word in cleanedup(line).split():
                if word in wordlist:
                    wordlist[word] += 1
                else:
                    wordlist[word] = 1


with open ('DQ.txt') as doc:
    text_crunch(doc)
    print(wordlist['todos'])

1 个答案:

答案 0 :(得分:1)

一个更简单的代码版本,几乎可以满足您的需求:)

import string
import collections

def cleanedup(fh):
    for line in fh:
        word = ''
        for character in line:
            if character in string.ascii_letters:
                word += character
            elif word:
                yield word
                word = ''

with open ('DQ.txt') as doc:
    wordlist = collections.Counter(cleanedup(doc))
    print wordlist.most_commond(5)

使用正则表达式的替代解决方案:

import re
import collections

def cleandup(fh):
    for line in fh:
        for word in re.findall('[a-z]+', line.lower()):
            yield word

with open ('DQ.txt') as doc:
    wordlist = collections.Counter(cleanedup(doc))
    print wordlist.most_commond(5)

或者:

import re
import collections

def cleandup(fh):
    for line in fh:
        for word in re.split('[^a-z]+', line.lower()):
            yield word

with open ('DQ.txt') as doc:
    wordlist = collections.Counter(cleanedup(doc))
    print wordlist.most_commond(5)