我必须在多个文件中存储每个单词的字数。在Perl中,我使用了哈希,例如$wcCount{$file}{$word}
。我无法弄清楚如何在python中做类似的事情。我尝试过使用这种效果,但显然没有效果
for line in fh:
arr = line.split()
for word in arr:
key = filename + word #creates a unique identifier for each word count
freqdict[key] += 1
我读了另一个类似问题的stackoverflow,但是当它再次计算时,它不允许更新值。
输入是多个文件的单词。输出应该只是每个文件的一个单词的频率列表(作为命令行参数)。
答案 0 :(得分:2)
假设您有Hamlet,并且您想要计算唯一的单词。
你可以这样做:
# the tools we need, read a url and regex library
import urllib2
import re
# a dict -- similar to Perl hash
words={}
# read the text at that url
response = urllib2.urlopen('http://pastebin.com/raw.php?i=7p3uycAz')
hamlet = response.read()
# split on whitespace, remove trailing punctuation, and count each unique word
for word in hamlet.split():
word=re.sub(r'\W+$', r'', word)
if word.strip():
words[word]=words.setdefault(word, 0) +1
然后,如果你想打印从最常见到最不重要的单词:
# sort descending on count, ascending on ascii lower case
for word, count in sorted(words.items(), key=lambda t: (-t[1], t[0].lower())):
print word, count
打印:
the 988
and 702
of 628
to 610
I 541
you 495
a 452
my 441
in 399
HAMLET 385
it 360
is 313
...
如果你想要一个嵌套的Dicts Dict(正如你的Perl示例所示)你可能会这样做:
# think of these strings like files; the letters like words
str1='abcdefaaa'
str2='abefdd'
str3='defeee'
letters={}
for fn, st in (('string 1', str1), ('string 2', str2) , ('string 3', str3)):
letters[fn]={}
for c in st:
letters[fn][c]=letters[fn].setdefault(c, 0)
letters[fn][c]+=1
print letters
# {'string 3': {'e': 4, 'd': 1, 'f': 1},
'string 1': {'a': 4, 'c': 1, 'b': 1, 'e': 1, 'd': 1, 'f': 1},
'string 2': {'a': 1, 'b': 1, 'e': 1, 'd': 2, 'f': 1}}
答案 1 :(得分:1)
您可以使用Counter
并使用元组(文件名,单词)作为键值,例如:
from collections import Counter
from itertools import chain
word_counts = Counter()
for filename in ['your', 'file names', 'here']:
with open(filename) as fin:
words = chain.from_iterable(line.split() for line in fin)
word_counts.update((filename, word) for word in words)
但是,您还可以做的是创建一个基于文件名的初始字典,使用Counter
,然后更新,以便您可以访问" hash"因为它是文件名作为键,然后是字数,例如:
word_counts = {filename: Counter() for filename in your_filenames}
for filename, counter in word_counts.items():
with open(filename) as fin:
words = chain.from_iterable(line.split() for line in fin)
word_counts[filename].update(words)
答案 2 :(得分:0)
如果您使用的是Python 2.7或更高版本,我建议收藏.Counter:
import collections
counter = collections.Counter()
for line in fh:
arr = line.split()
for word in arr:
key = filename + word #creates a unique identifier for each word count
counter.update((key,))
你可以查看这样的计数:
for key, value in counter.items():
print('{0}: {1}'.format(key, value))
答案 3 :(得分:0)
我不是Perl程序员,但我相信Python中的以下解决方案会让你最接近Perl中的$wcCount{$file}{$word}
。
from collections import Counter
from itertools import chain
def count_words(filename):
with open(filename, 'r') as f:
word_iter = chain.from_iterable(line.split() for line in f)
return Counter(word_iter)
word_counts = {file_name : count_words(file_name) for file_name in file_names}
答案 4 :(得分:0)
或者,您可以从了解nltk(自然语言工具包)中受益。 如果你最终做的不仅仅是单词频率,那么它可能会有很大的帮助。
这里解析句子然后解释:
import nltk
import urllib2
hamlet = urllib2.urlopen('http://pastebin.com/raw.php?i=7p3uycAz').read().lower()
word_freq = nltk.FreqDist()
for sentence in nltk.sent_tokenize(hamlet):
for word in nltk.word_tokenize(sentence):
word_freq[word] += 1
word_freq:
FreqDist({',':3269,'。':1283,'':1138,'和': 965,'到':737,'':669,' i':629,&#39 ;;':582,&#39 ;你':553,':':535,...})