我有两个或三个具有相同标题的csv文件,并希望在同一个图上绘制彼此重叠的每列的直方图。
以下代码为我提供了两个单独的图,每个图包含每个文件的所有直方图。有没有一种紧凑的方法可以使用pandas / matplot lib在同一个图上绘制它们?我想象一些接近this但使用数据帧的东西。
代码:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('input1.csv')
df2 = pd.read_csv('input2.csv')
df.hist(bins=20)
df2.hist(bins=20)
plt.show()
答案 0 :(得分:5)
In [18]: from pandas import DataFrame
In [19]: from numpy.random import randn
In [20]: df = DataFrame(randn(10, 2))
In [21]: df2 = DataFrame(randn(10, 2))
In [22]: axs = df.hist()
In [23]: for ax, (colname, values) in zip(axs.flat, df2.iteritems()):
....: values.hist(ax=ax, bins=10)
....:
In [24]: draw()
给出
答案 1 :(得分:1)
Phillip Cloud 在回答中已经解决了在单个图形中的并排图中叠加包含相同变量的两个(或多个)数据帧的直方图的主要问题。
此答案为问题作者(在已接受答案的评论中)提出的问题提供了解决方案,该问题涉及如何为两个数据帧共有的变量强制执行相同数量的 bin 和范围。这可以通过创建两个数据帧的所有变量共有的 bin 列表来完成。事实上,这个答案更进一步,针对每个数据帧中包含的不同变量覆盖略有不同的范围(但仍在同一数量级内)的情况调整图,如下例所示:
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
from matplotlib.lines import Line2D
# Set seed for random data
rng = np.random.default_rng(seed=1)
# Create two similar dataframes each containing two random variables,
# with df2 twice the size of df1
df1_size = 1000
df1 = pd.DataFrame(dict(var1 = rng.exponential(scale=1.0, size=df1_size),
var2 = rng.normal(loc=40, scale=5, size=df1_size)))
df2_size = 2*df1_size
df2 = pd.DataFrame(dict(var1 = rng.exponential(scale=2.0, size=df2_size),
var2 = rng.normal(loc=50, scale=10, size=df2_size)))
# Combine the dataframes to extract the min/max values of each variable
df_combined = pd.concat([df1, df2])
vars_min = [df_combined[var].min() for var in df_combined]
vars_max = [df_combined[var].max() for var in df_combined]
# Create custom bins based on the min/max of all values from both
# dataframes to ensure that in each histogram the bins are aligned
# making them easily comparable
nbins = 30
bin_edges, step = np.linspace(min(vars_min), max(vars_max), nbins+1, retstep=True)
# Create figure by combining the outputs of two pandas df.hist() function
# calls using the 'step' type of histogram to improve plot readability
htype = 'step'
alpha = 0.7
lw = 2
axs = df1.hist(figsize=(10,4), bins=bin_edges, histtype=htype,
linewidth=lw, alpha=alpha, label='df1')
df2.hist(ax=axs.flatten(), grid=False, bins=bin_edges, histtype=htype,
linewidth=lw, alpha=alpha, label='df2')
# Adjust x-axes limits based on min/max values and step between bins, and
# remove top/right spines: if, contrary to this example dataset, var1 and
# var2 cover the same range, setting the x-axes limits with this loop is
# not necessary
for ax, v_min, v_max in zip(axs.flatten(), vars_min, vars_max):
ax.set_xlim(v_min-2*step, v_max+2*step)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Edit legend to get lines as legend keys instead of the default polygons:
# use legend handles and labels from any of the axes in the axs object
# (here taken from first one) seeing as the legend box is by default only
# shown in the last subplot when using the plt.legend() function.
handles, labels = axs.flatten()[0].get_legend_handles_labels()
lines = [Line2D([0], [0], lw=lw, color=h.get_facecolor()[:-1], alpha=alpha)
for h in handles]
plt.legend(lines, labels, frameon=False)
plt.suptitle('Pandas', x=0.5, y=1.1, fontsize=14)
plt.show()
值得注意的是,seaborn 包提供了一种更方便的方式来创建这种绘图,与 Pandas 不同的是,bins 会自动对齐。唯一的缺点是必须首先组合数据帧并重新整形为长格式,如本示例所示,使用与以前相同的数据帧和 bin:
import seaborn as sns # v 0.11.0
# Combine dataframes and convert the combined dataframe to long format
df_concat = pd.concat([df1, df2], keys=['df1','df2']).reset_index(level=0)
df_melt = df_concat.melt(id_vars='level_0', var_name='var_id')
# Create figure using seaborn displot: note that the bins are automatically
# aligned thanks the 'common_bins' parameter of the seaborn histplot function
# (called here with 'kind='hist'') that is set to True by default. Here, the
# bins from the previous example are used to make the figures more comparable.
# Also note that the facets share the same x and y axes by default, this can
# be changed when var1 and var2 have different ranges and different
# distribution shapes, as it is the case in this example.
g = sns.displot(df_melt, kind='hist', x='value', col='var_id', hue='level_0',
element='step', bins=bin_edges, fill=False, height=4,
facet_kws=dict(sharex=False, sharey=False))
# For some reason setting sharex as above does not automatically adjust the
# x-axes limits (even when not setting a bins argument, maybe due to a bug
# with this package version) which is why this is done in the following loop,
# but note that you still need to set 'sharex=False' in displot, or else
# 'ax.set.xlim' will have no effect.
for ax, v_min, v_max in zip(g.axes.flatten(), vars_min, vars_max):
ax.set_xlim(v_min-2*step, v_max+2*step)
# Additional formatting
g.legend.set_bbox_to_anchor((.9, 0.75))
g.legend.set_title('')
plt.suptitle('Seaborn', x=0.5, y=1.1, fontsize=14)
plt.show()
您可能会注意到,直方图线在 bin 边缘列表的边界处被截断(由于比例尺,在最大边上不可见)。为了获得更类似于熊猫示例的行,可以在 bin 列表的每个末端添加一个空 bin,如下所示:
bin_edges = np.insert(bin_edges, 0, bin_edges.min()-step)
bin_edges = np.append(bin_edges, bin_edges.max()+step)
这个例子还说明了这种为两个方面设置公共箱的方法的局限性。由于 var1 var2 的范围有些不同,并且使用了 30 个 bin 来覆盖组合范围,因此 var1 的直方图包含的 bin 很少,而 var2 的直方图包含的 bin 略多于必要。据我所知,在调用绘图函数 df.hist()
和 displot(df)
时,没有直接的方法可以为每个方面分配不同的 bin 列表。因此,对于变量涵盖显着不同范围的情况,必须使用 matplotlib 或其他绘图库从头开始创建这些数字。