我试图从文本文件中创建单词词典,然后计算每个单词的实例,并能够在词典中搜索单词并接收计数,但我仍处于静止状态。我最麻烦的是将文本文件字小写并删除它们的标点符号,否则我的计数将会关闭。有什么建议吗?
f=open("C:\Users\Mark\Desktop\jefferson.txt","r")
wc={}
words = f.read().split()
count = 0
i = 0
for line in f: count += len(line.split())
for w in words: if i < count: words[i].translate(None, string.punctuation).lower() i += 1 else: i += 1 print words
for w in words: if w not in wc: wc[w] = 1 else: wc[w] += 1
print wc['states']
答案 0 :(得分:1)
几点:
在Python中,始终使用以下构造来读取文件:
with open('ls;df', 'r') as f:
# rest of the statements
如果使用f.read().split()
,则会读到文件的末尾。之后你需要回到开头:
f.seek(0)
第三,你所在的部分:
for w in words:
if i < count:
words[i].translate(None, string.punctuation).lower()
i += 1
else:
i += 1
print words
你不需要在Python中保留一个计数器。你可以简单地做......
for i, w in enumerate(words):
if i < count:
words[i].translate(None, string.punctuation).lower()
else:
print words
但是,您甚至不需要在这里检查i < count
...您可以这样做:
words = [w.translate(None, string.punctuation).lower() for w in words]
最后,如果您只想计算states
而不是创建整个项目词典,请考虑使用过滤器....
print len(filter( lambda m: m == 'states', words ))
最后一件事......
如果文件很大,则不建议将每个单词一次放入内存。请考虑逐行更新wc
字典。你可以考虑:
for line in f:
words = line.split()
# rest of your code
答案 1 :(得分:1)
这听起来像是collections.Counter
:
import collections
with open('gettysburg.txt') as f:
c = collections.Counter(f.read().split())
print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)
结果:
$ python foo.py
'Four' appears 1 times
'the' appears 9 times
There are 267 total words
The 5 most common words are [('that', 10), ('the', 9), ('to', 8), ('we', 8), ('a', 7)]
当然,这算是“自由”和“这个”。作为单词(注意单词中的标点符号)。此外,它将“The”和“the”视为不同的词。此外,处理整个文件可能会导致非常大的文件丢失。
这是一个忽略标点符号和大小写的版本,对大文件的内存效率更高。
import collections
import re
with open('gettysburg.txt') as f:
c = collections.Counter(
word.lower()
for line in f
for word in re.findall(r'\b[^\W\d_]+\b', line))
print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)
结果:
$ python foo.py
'Four' appears 0 times
'the' appears 11 times
There are 271 total words
The 5 most common words are [('that', 13), ('the', 11), ('we', 10), ('to', 8), ('here', 8)]
参考文献:
答案 2 :(得分:0)
File_Name = 'file.txt'
counterDict={}
with open(File_Name,'r') as fh:
for line in fh:
# removing their punctuation
words = line.replace('.','').replace('\'','').replace(',','').lower().split()
for word in words:
if word not in counterDict:
counterDict[word] = 1
else:
counterDict[word] = counterDict[word] + 1
print('Count of the word > common< :: ', counterDict.get('common',0))