Question

我想在多个.txt文件中找到保留字列表的字频，作为熊猫数据框。我正在使用collections.Counter（）对象，并且如果某个单词未出现在文本中，则该单词（键）的值在Counter（）中为零。

理想地，结果是一个数据帧，其中每一行对应于每个.txt文件，列标题对应于保留字，第i行第j列中的条目对应于i-中第j个单词的频率.txt文件。

这是我的代码，但是问题在于，就每个键（或保留字）具有多个值的字典而言，Counter（）对象没有附加，而是求和：

for filepath in iglob(os.path.join(folder_path, '*.txt')):
    with open(filepath) as file:
        cnt = Counter()
        tokens = re.findall(r'\w+', file.read().lower())
        for word in tokens:
            if word in mylist:
                cnt[word] += 1
            for key in mylist:
                if key not in cnt:
                    cnt[key] = 0
        dictionary = defaultdict(list)
        for key, value in cnt.items():
            dictionary[key].append(value)
    print(dictionary)

任何提示将不胜感激！

Answer 1

您需要在循环之前为数据框创建字典，然后将每个文本文件的Counter值复制/追加。

#!/usr/bin/env python3
import os
import re
from collections import Counter
from glob import iglob


def main():
    folder_path = '...'
    keywords = ['spam', 'ham', 'parrot']

    keyword2counts = {keyword: list() for keyword in keywords}
    for filename in iglob(os.path.join(folder_path, '*.txt')):
        with open(filename) as file:
            words = re.findall(r'\w+', file.read().lower())

        keyword2count = Counter(word for word in words if word in keywords)

        for keyword in keywords:
            keyword2counts[keyword].append(keyword2count[keyword])

    print(keyword2counts)


if __name__ == '__main__':
    main()

测试list中的项目是否比测试set中的项目要慢得多。因此，如果这太慢了，您可以将set用于keywords或将另一个用于测试。

如果列的顺序相关，则在Python 3.7（或CPython 3.6）之前还有collections.OrderedDict。

附加多个Counter（）对象并转换为数据框

1 个答案: