字典列表 - 跟踪每个文件的单词频率

时间:2014-03-19 22:45:34

标签: python data-structures dictionary

我编写了一些代码来计算多个文本文件中的单词频率,并将它们存储在字典中。

我一直在努力寻找一种方法来保持每个单词的每个文件的运行总数,如下所示:

word1 [1] [20] [30] [22] word2 [5] [7] [0] [4]

我尝试过使用计数器,但我还是找不到合适的方法/数据结构。

import string 
from collections import defaultdict
from collections import Counter
import glob
import os


# Words to remove
noise_words_set = {'the','to','of','a','in','is',...etc...}


# Find files
path = r"C:\Users\Logs"
os.chdir(path)
print("Processing files...")
for file in glob.glob("*.txt"):

    # Read file
    txt = open("{}\{}".format(path, file),'r', encoding="utf8").read()

    # Remove punctuation
    for punct in string.punctuation:
        txt = txt.replace(punct,"")

    # Split into words and make lower case
    words = [item.lower() for item in txt.split()]

    # Remove unintersting words
    words = [w for w in words if w not in noise_words_set]

    # Make a dictionary of words
    D = defaultdict(int)
    for word in words:
        D[word] += 1

    # Add to some data structure (?) that keeps count per file
    #...word1 [1] [20] [30] [22]
    #...word2 [5] [7] [0] [4]

4 个答案:

答案 0 :(得分:2)

几乎使用整个结构!

from collections import Counter

files = dict() # this may be better as a list, tbh

table = str.maketrans('','',string.punctuation)

for file in glob.glob("*.txt"):
    with open(file) as f:
        word_count = Counter()
        for line in f:
            word_count += Counter([word.lower() for word in line.translate(table) if
                                  word not in noise_words_set])
    files[file] = word_count # if list: files.append(word_count)

如果您希望将它们翻译成某些字典,请在此之后执行此操作

words_count = dict()
for file in files:
    for word,value in file.items():
        try: words_count[word].append(value)
        except KeyError: words_count[word] = [value]

答案 1 :(得分:2)

你绝对应该在课堂上重建这个。这将允许您将所需的项目存储为全局对象(即,您可以拥有一个运行在单个文件上并将其添加到其中的函数。)

那就是说我会构建一个包含dict的defaultdict。

defaultdict(dict)

我会使用以下协议构建它(将总文件数量和单个文件数量存储在同一数据结构中):

{word1:{filename1:5, filename2:20, total:25}, word2:{filename1:10, filename2:13, total:23}, ...}

为了构建这个,你需要在for循环文件之外移动defaultdict调用。我继续为您重新构建代码:

import string 
from collections import defaultdict
from collections import Counter
import glob
import os


# Words to remove
noise_words_set = {'the','to','of','a','in','is',...etc...}


# Find files
path = r"C:\Users\Logs"
os.chdir(path)
print("Processing files...")

#global defaultdict
D = defaultdict(lambda: defaultdict(int))

#global counter (for file #)
counter = 1

for file in glob.glob("*.txt"):

    #create name for file number
    file_number = "file{number}".format(number=counter)

    # Read file
    txt = open("{}\{}".format(path, file),'r', encoding="utf8").read()

    # Remove punctuation
    for punct in string.punctuation:
        txt = txt.replace(punct,"")

    # Split into words and make lower case
    words = [item.lower() for item in txt.split()]

    # Remove unintersting words
    words = [w for w in words if w not in noise_words_set]

    # Make a dictionary of words
    for word in words:
        #add count to the file and the total
        D[word][file_number] += 1
        D[word]["total"] += 1

    counter += 1

答案 2 :(得分:1)

我希望这可以帮到你:

wordRef = defaultdict(lambda : defaultdict(int))

... some code ...

for file in glob.glob("*.txt"):

    ... some code ...

    for word in words:
        wordRef[word][file] += 1

答案 3 :(得分:1)

D = defaultdict(lambda: defaultdict(int)) for file in glob.glob("*.txt"): ...your code... for word in words: D[word][file] +=1