我正在编写一个脚本,查找目录中的所有文本文件,然后查找文件中的行数和最常用的单词。我知道这不是最简单/最好的方法,但我对python(2周)很新。
我遇到的一个小问题是我有两个主要词典。一个存储文件和行数,另一个存储文件,行数和字数,其频率如下:
dict1_example = {'file':'lines'}
dict2_example = {'file': 'lines', ('word':'count')}
我希望能够从所有文件中提取最常用的单词,即访问第二个字典中的(' word':#39; count')位。
有没有办法从该部分获取信息,或者我是否需要使用这些函数并使用{'word':'count'}
创建一个额外的字典?
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import glob
import os
from sys import argv
import re
from collections import Counter
script, directory = argv
def file_len2(filename2):
with open(filename2) as f2:
l2 = [x for x in f2.readlines() if x != "\n"]
return len(l2)
def word_count(filename3):
with open(filename3) as f3:
passage = f3.read()
stop_words = ("THE", "OF", "A", "TO", "AND", "IS", "IN", "YOU", "THAT", "IT", "THIS", "YOUR", "AS", "AN", "BUT", "FOR")
words = re.findall(r'\w+', passage)
cap_words = [word.upper() for word in words if word.upper() not in stop_words]
word_counts = Counter(cap_words)
return max(word_counts, key=word_counts.get), word_counts[max(word_counts, key=word_counts.get)]
files = glob.glob(directory + "/*.txt")
length = {}
file_info = {}
for file in files:
lines = file_len2(file)
length[file] = lines
file_info[file] = lines, word_count(file)
for file, lines in length.iteritems():
print '{}: {}'.format(os.path.basename(file), lines), word_count(file);
maximum_file = max(length, key=length.get)
minimum_file = min(length, key=length.get)
maximum_lines = os.path.basename(max(length, key=length.get))
minimum_lines = os.path.basename(min(length, key=length.get))
print "The file with the maximum number of lines:"
print "%r lines in %r " % (length[maximum_file], maximum_lines)
print "The file with the minimum number of lines:"
print "%r lines in %r" % (length[minimum_file], minimum_lines)
sum_lines = sum(length.values())
number_of_values = len(length)
average = sum_lines / number_of_values
print "The average number of lines in a text file in given directory: ", average, "- Rounded down"
答案 0 :(得分:0)
我似乎通过制作另一个dictonary解决了我的问题:
word_freq[file] = word_count(file)
并切换
上的返回值def word_count(filename3)
然后我用它来得到最常用的词:
print word_freq[max(word_freq, key=word_freq.get)]