Question

我正在编写有关Zipf Law发行的代码。我的任务是在一个文件夹中输入10个文本文件，并输出一个包含四行的表： 1.排名（1,2,3,4 ...等）（r） 2.具有最高频率到最低频率的单词。 3.单词（f）的确切频率 4.（r * f）但是我遇到了三个问题： 1.如何一次将10个文本文件（在一个文件夹下）的数据输入到python中？ 2.如何使用我分析到表中的数据来绘制数据？ 3.是否可以在表格后绘制绘图图？（使用matplotlib？如何？）

我尝试仅分析一个文本文件的数据。但是我找不到一种同时分析10个文本文件的数据的方法。此外，我不知道如何使用分析的数据在python中绘制表格。

import re
from operator import itemgetter

frequency = {}

open_file = open(r'C:\最后上机作业（12.12.2018）\english\e6.txt')

file_to_string = open_file.read()
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)',file_to_string)

for word in words:
    count = frequency.get(word,0)
    frequency[word] = count+1

for (key, value) in reversed(sorted(frequency.items(),key = itemgetter(1))):
    print(key,value)

输出是每个单词的频率。但是我正在寻找一种方法来一次汇总10个文本文件的所有数据，并使用它们在python中绘制表格。这是我试图绘制表格的代码，但是我搞砸了数据输入：

plt.figure()
ax = plt.gca()
y = np.random.randn(9)

col_labels = ['r','word','frequency','r*f']
row_labels = ['1','2','3','4','5'] #I am trying to arrange the data from highest frequencies to lowest frequencies, not only top 5 but all.Is it possible the code can arrange by itself?
table_vals = [[sorted(frequency.items(),key = itemgetter(1))],[21,22,23],[28,29,30]] #How to enter data I analyzed in table value?
row_colors = ['red','gold','green']
my_table = plt.table(cellText=table_vals, colWidths=[0.1]*3,
                     rowLabels=row_labels, colLabels=col_labels,
                     rowColours=row_colors, colColours=row_colors,
                     loc='best')
plt.plot(y)

plt.show()

This is how the table should look like

Answer 1

计数单词的有效方法是使用集合库和该库的Counter类。请参阅以下示例代码，该代码从同一位置读取所有txt文件并计算单词数

# this code reads all the text files in the same location.
import glob, os
from collections import Counter
words =[]
for file in glob.glob("*.txt"):  # or you can use full path
    f = open(file,'r') 
    for line in f:
        for word in line.split():
           words.append(word)    
# Print all the words in all txt files
print(words)
print("\nCounts..............")
#Counts words with
word_counts = Counter()
for word in words:
    word_counts[word] += 1  
print(word_counts)

如果是完整路径：

for file in glob.glob("C:/Users/Admin/Desktop/text/*.txt"):

测试文件内容：
第一个文本文件：

first text file c
count
anything any word

第二个文本文件：

second file and 
with texts 
dskfhj dsj fkjs

在目录中创建几个文本文件，然后在其中写入python文件，运行python代码，查看是否获得预期的结果，然后相应地修改代码。

以上代码的结果：

Counter({'file': 2, 'first': 1, 'text': 1, 'c': 1, 'count': 1, 'anything': 1, 'any': 1, 'word': 1, 'second': 1, 'and': 1, 'with': 1, 'texts': 1, 'dskfhj': 1, 'dsj': 1, 'fkjs': 1})

参考链接：Counter

最后，根据输出将所有这些计数器数据轻松放入表中

如何一次读取多个文本文件并使用数据？

1 个答案: