Question

我必须在多个文件中存储每个单词的字数。在Perl中，我使用了哈希，例如$wcCount{$file}{$word}。我无法弄清楚如何在python中做类似的事情。我尝试过使用这种效果，但显然没有效果

for line in fh:
    arr = line.split()
    for word in arr:
        key = filename + word  #creates a unique identifier for each word count
        freqdict[key] += 1

我读了另一个类似问题的stackoverflow，但是当它再次计算时，它不允许更新值。

输入是多个文件的单词。输出应该只是每个文件的一个单词的频率列表（作为命令行参数）。

Answer 1

假设您有Hamlet，并且您想要计算唯一的单词。

你可以这样做：

# the tools we need, read a url and regex library 
import urllib2
import re

# a dict -- similar to Perl hash
words={}

# read the text at that url
response = urllib2.urlopen('http://pastebin.com/raw.php?i=7p3uycAz')
hamlet = response.read()

# split on whitespace, remove trailing punctuation, and count each unique word
for word in hamlet.split():
    word=re.sub(r'\W+$', r'', word)
    if word.strip(): 
        words[word]=words.setdefault(word, 0) +1

然后，如果你想打印从最常见到最不重要的单词：

# sort descending on count, ascending on ascii lower case
for word, count in sorted(words.items(), key=lambda t: (-t[1], t[0].lower())):
    print word, count

打印：

the 988
and 702
of 628
to 610
I 541
you 495
a 452
my 441
in 399
HAMLET 385
it 360
is 313
...

如果你想要一个嵌套的Dicts Dict（正如你的Perl示例所示）你可能会这样做：

# think of these strings like files; the letters like words
str1='abcdefaaa'
str2='abefdd'
str3='defeee'

letters={}

for fn, st in (('string 1', str1), ('string 2', str2) , ('string 3', str3)):
    letters[fn]={}
    for c in st:
        letters[fn][c]=letters[fn].setdefault(c, 0)
        letters[fn][c]+=1

print letters     
# {'string 3': {'e': 4, 'd': 1, 'f': 1}, 
   'string 1': {'a': 4, 'c': 1, 'b': 1, 'e': 1, 'd': 1, 'f': 1}, 
   'string 2': {'a': 1, 'b': 1, 'e': 1, 'd': 2, 'f': 1}}

Answer 2

您可以使用Counter并使用元组（文件名，单词）作为键值，例如：

from collections import Counter
from itertools import chain

word_counts = Counter()
for filename in ['your', 'file names', 'here']:
    with open(filename) as fin:
        words = chain.from_iterable(line.split() for line in fin)
        word_counts.update((filename, word) for word in words)

但是，您还可以做的是创建一个基于文件名的初始字典，使用Counter，然后更新，以便您可以访问＆＃34; hash＆＃34;因为它是文件名作为键，然后是字数，例如：

word_counts = {filename: Counter() for filename in your_filenames}
for filename, counter in word_counts.items():
    with open(filename) as fin:
        words = chain.from_iterable(line.split() for line in fin)
        word_counts[filename].update(words)

Answer 3

如果您使用的是Python 2.7或更高版本，我建议收藏.Counter：

import collections

counter = collections.Counter()

for line in fh:
    arr = line.split()
    for word in arr:
        key = filename + word  #creates a unique identifier for each word count
        counter.update((key,))

你可以查看这样的计数：

for key, value in counter.items():
    print('{0}: {1}'.format(key, value))

Answer 4

我不是Perl程序员，但我相信Python中的以下解决方案会让你最接近Perl中的$wcCount{$file}{$word}。

from collections import Counter
from itertools import chain

def count_words(filename):
    with open(filename, 'r') as f:
        word_iter = chain.from_iterable(line.split() for line in f)
        return Counter(word_iter)

word_counts = {file_name : count_words(file_name) for file_name in file_names}

Answer 5

或者，您可以从了解nltk（自然语言工具包）中受益。如果你最终做的不仅仅是单词频率，那么它可能会有很大的帮助。

这里解析句子然后解释：

import nltk
import urllib2

hamlet = urllib2.urlopen('http://pastebin.com/raw.php?i=7p3uycAz').read().lower()

word_freq = nltk.FreqDist()
for sentence in nltk.sent_tokenize(hamlet):
    for word in nltk.word_tokenize(sentence): 
        word_freq[word] += 1

word_freq：

FreqDist（{＆＃39;，＆＃39;：3269，＆＃39;。＆＃39;：1283，＆＃39;＆＃39;：1138，＆＃39;和＆＃39;： 965，＆＃39;到＆＃39;：737，＆＃39;＆＃39;：669，＆＃39; i＆＃39;：629，＆＃39 ;;＆＃39;：582，＆＃39 ;你＆＃39;：553，＆＃39;：＆＃39;：535，...}）

在python中填充字典

5 个答案: