嗯，我有以下文字：

Question

我要尝试的是分析文本中字母的频率。例如，我将在此处使用一小段句子，但是所有这些都被认为可以分析大型文本（因此最好是高效的。）

嗯，我有以下文字：

test = "quatre jutges dun jutjat mengen fetge dun penjat"

然后我创建了一个计算频率的函数

def create_dictionary2(txt):
    dictionary = {}
    i=0
    for x in set(txt):
        dictionary[x] = txt.count(x)/len(txt)
    return dictionary

然后

import numpy as np
import matplotlib.pyplot as plt
test_dict = create_dictionary2(test)
plt.bar(test_dict.keys(), test_dict.values(), width=0.5, color='g')

我获得

问题：我想查看所有字母，但其中一些字母看不到（15位艺术家的容器对象）如何扩展直方图？然后，我想对直方图进行排序，以从中获得类似的结果

这个

Answer 1

为了进行计数，我们可以使用Counter对象。 Counter也支持在most common值上获取键值对：

from collections import Counter

import numpy as np
import matplotlib.pyplot as plt

c = Counter("quatre jutges dun jutjat mengen fetge dun penjat")
plt.bar(*zip(*c.most_common()), width=.5, color='g')

most_common方法返回键值元组的列表。 *zip(*..)用于打开包装（请参见this answer）。

注意：我尚未更新宽度或颜色以匹配您的结果图。

Answer 2

使用熊猫的另一种解决方案：

import pandas as pd
import matplotlib.pyplot as plt

test = "quatre jutges dun jutjat mengen fetge dun penjat"

# convert input to list of chars so it is easy to get into pandas 
char_list = list(test)

# create a dataframe where each char is one row
df = pd.DataFrame({'chars': char_list})
# drop all the space characters
df = df[df.chars != ' ']
# add a column for aggregation later
df['num'] = 1
# group rows by character type, count the occurences in each group
# and sort by occurance
df = df.groupby('chars').sum().sort_values('num', ascending=False) / len(df)

plt.bar(df.index, df.num, width=0.5, color='g')
plt.show()

结果：

编辑：我为ikkuh和我的解决方案计时了

使用计数器：10000个循环，最好为3：每个循环21.3 µs

使用pandas groupby：10个循环，最好3个循环：每个循环22.1毫秒

对于这个小的数据集，Counter肯定快很多。也许我有时间的时候会花更多时间。

字母频率：绘制直方图，对值PYTHON进行排序

嗯，我有以下文字：

2 个答案: