从多值字典中访问最频繁的子值

时间:2014-08-22 11:01:49

标签: python python-2.7 dictionary

我正在编写一个脚本,查找目录中的所有文本文件,然后查找文件中的行数和最常用的单词。我知道这不是最简单/最好的方法,但我对python(2周)很新。

我遇到的一个小问题是我有两个主要词典。一个存储文件和行数,另一个存储文件,行数和字数,其频率如下:

dict1_example = {'file':'lines'}
dict2_example = {'file': 'lines', ('word':'count')}

我希望能够从所有文件中提取最常用的单词,即访问第二个字典中的(' word':#39; count')位。

有没有办法从该部分获取信息,或者我是否需要使用这些函数并使用{'word':'count'}创建一个额外的字典?

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import glob
import os
from sys import argv
import re
from collections import Counter

script, directory = argv

def file_len2(filename2):
    with open(filename2) as f2:
        l2 = [x for x in f2.readlines() if x != "\n"]
    return len(l2)

def word_count(filename3):
    with open(filename3) as f3:
        passage = f3.read()

    stop_words = ("THE", "OF", "A", "TO", "AND", "IS", "IN", "YOU", "THAT", "IT", "THIS", "YOUR", "AS", "AN", "BUT", "FOR")
    words = re.findall(r'\w+', passage)
    cap_words = [word.upper() for word in words if word.upper() not in stop_words]
    word_counts = Counter(cap_words)
    return max(word_counts, key=word_counts.get), word_counts[max(word_counts, key=word_counts.get)]



files = glob.glob(directory + "/*.txt")


length = {}
file_info = {}

for file in files:
    lines = file_len2(file)
    length[file] = lines
    file_info[file] = lines, word_count(file)


for file, lines in length.iteritems():
    print '{}: {}'.format(os.path.basename(file), lines), word_count(file);




maximum_file = max(length, key=length.get)
minimum_file = min(length, key=length.get)

maximum_lines = os.path.basename(max(length, key=length.get))
minimum_lines = os.path.basename(min(length, key=length.get))


print "The file with the maximum number of lines:" 
print "%r lines in %r " % (length[maximum_file], maximum_lines)

print "The file with the minimum number of lines:" 
print "%r lines in %r" % (length[minimum_file], minimum_lines)

sum_lines = sum(length.values())
number_of_values = len(length)

average = sum_lines / number_of_values

print "The average number of lines in a text file in given directory: ", average, "- Rounded down"

1 个答案:

答案 0 :(得分:0)

我似乎通过制作另一个dictonary解决了我的问题:

word_freq[file] = word_count(file)

并切换

上的返回值
def word_count(filename3)

然后我用它来得到最常用的词:

print word_freq[max(word_freq, key=word_freq.get)]