Question

我有以下DataFrame df（提供了一个小提取）：

time_diff   avg_qty_per_day
1.450000    1.0
1.483333    1.0
1.500000    1.0
2.516667    1.0
2.533333    1.0
2.533333    1.5
3.633333    1.8
3.644567    5.0

我想为变量time_diff创建直方图，以便了解其值如何分布以及哪些值具有最高频率。

我使用此代码执行此操作：

bins = np.arange(df['time_diff'].min(), df['time_diff'].max()+1, 1)
hist, edges = np.histogram(df['time_diff'], bins=bins)

norm = plt.Normalize(hist.min(), hist.max())
colors = plt.cm.YlGnBu(norm(hist)) 

fig, ax = plt.subplots(figsize=(14,8))
ax.bar(edges[:-1], hist, np.diff(edges), color=colors, ec="k")
plt.xticks(rotation='vertical', fontsize=11)
plt.xticks(np.arange(min(df['time_diff']), max(df['time_diff'])+1, 1.0))
ax.get_xaxis().set_major_formatter(
    matplotlib.ticker.FuncFormatter(lambda x,p: locale.format('%d', x, 1)))
plt.show()

问题是此代码不考虑avg_qty_per_day的值来计算time_diff（Y轴）的频率。它只将每一行计为1次。但是，我需要使用avg_qty_per_day作为出现次数。我该如何解决这个问题？

更新

例如，如果我这样做：

a = df[df.time_diff>3]
np.sum(a.avg_trips_per_day.values)

...，然后我得到答案6.8。它与我的情节中的Y轴不一致，我对应的bin有2。

Answer 1

我认为首先需要将“time_diff”分组为“avg_qty_per_day”。这是一个虚拟代码

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=list('AB'))
print(df.head())
df.groupby("A").sum()
plt.hist(df.A)
plt.show()

如何为给定的DataFrame创建直方图？

1 个答案: