我编写了一些代码来计算多个文本文件中的单词频率,并将它们存储在字典中。
我一直在努力寻找一种方法来保持每个单词的每个文件的运行总数,如下所示:
word1 [1] [20] [30] [22]
word2 [5] [7] [0] [4]
我尝试过使用计数器,但我还是找不到合适的方法/数据结构。
import string
from collections import defaultdict
from collections import Counter
import glob
import os
# Words to remove
noise_words_set = {'the','to','of','a','in','is',...etc...}
# Find files
path = r"C:\Users\Logs"
os.chdir(path)
print("Processing files...")
for file in glob.glob("*.txt"):
# Read file
txt = open("{}\{}".format(path, file),'r', encoding="utf8").read()
# Remove punctuation
for punct in string.punctuation:
txt = txt.replace(punct,"")
# Split into words and make lower case
words = [item.lower() for item in txt.split()]
# Remove unintersting words
words = [w for w in words if w not in noise_words_set]
# Make a dictionary of words
D = defaultdict(int)
for word in words:
D[word] += 1
# Add to some data structure (?) that keeps count per file
#...word1 [1] [20] [30] [22]
#...word2 [5] [7] [0] [4]
答案 0 :(得分:2)
几乎使用整个结构!
from collections import Counter
files = dict() # this may be better as a list, tbh
table = str.maketrans('','',string.punctuation)
for file in glob.glob("*.txt"):
with open(file) as f:
word_count = Counter()
for line in f:
word_count += Counter([word.lower() for word in line.translate(table) if
word not in noise_words_set])
files[file] = word_count # if list: files.append(word_count)
如果您希望将它们翻译成某些字典,请在此之后执行此操作
words_count = dict()
for file in files:
for word,value in file.items():
try: words_count[word].append(value)
except KeyError: words_count[word] = [value]
答案 1 :(得分:2)
你绝对应该在课堂上重建这个。这将允许您将所需的项目存储为全局对象(即,您可以拥有一个运行在单个文件上并将其添加到其中的函数。)
那就是说我会构建一个包含dict的defaultdict。
defaultdict(dict)
我会使用以下协议构建它(将总文件数量和单个文件数量存储在同一数据结构中):
{word1:{filename1:5, filename2:20, total:25}, word2:{filename1:10, filename2:13, total:23}, ...}
为了构建这个,你需要在for循环文件之外移动defaultdict调用。我继续为您重新构建代码:
import string
from collections import defaultdict
from collections import Counter
import glob
import os
# Words to remove
noise_words_set = {'the','to','of','a','in','is',...etc...}
# Find files
path = r"C:\Users\Logs"
os.chdir(path)
print("Processing files...")
#global defaultdict
D = defaultdict(lambda: defaultdict(int))
#global counter (for file #)
counter = 1
for file in glob.glob("*.txt"):
#create name for file number
file_number = "file{number}".format(number=counter)
# Read file
txt = open("{}\{}".format(path, file),'r', encoding="utf8").read()
# Remove punctuation
for punct in string.punctuation:
txt = txt.replace(punct,"")
# Split into words and make lower case
words = [item.lower() for item in txt.split()]
# Remove unintersting words
words = [w for w in words if w not in noise_words_set]
# Make a dictionary of words
for word in words:
#add count to the file and the total
D[word][file_number] += 1
D[word]["total"] += 1
counter += 1
答案 2 :(得分:1)
我希望这可以帮到你:
wordRef = defaultdict(lambda : defaultdict(int))
... some code ...
for file in glob.glob("*.txt"):
... some code ...
for word in words:
wordRef[word][file] += 1
答案 3 :(得分:1)
D = defaultdict(lambda: defaultdict(int))
for file in glob.glob("*.txt"):
...your code...
for word in words:
D[word][file] +=1