Question

我正在通过pandas读取csv文件，并按如下方式制作简单的直方图：

df = pd.read_csv(sys.argv[1],header=0)
hFare = df['Fare'].dropna().hist(bins=[0,10,20,30,45,60,75,100,600],label = "All")
hSurFare = df[df.Survived==1]['Fare'].dropna().hist(bins=[0,10,20,30,45,60,75,100,600],label="Survivors")

我想要的是两个直方图的bin by bin比率。有一个简单的方法吗？

Answer 1

首先，我们将创建一些示例数据。将来如果你问一个关于熊猫的问题，最好包括人们可以轻松地复制粘贴到他们的Python控制台的示例数据：

import pandas as pd
import numpy as np
df = pd.DataFrame({'Fare': np.random.uniform(0, 600, 400), 
                   'Survived': np.random.randint(0, 2, 400)})

然后使用pd.cut以与您在直方图中相同的方式对数据进行分区：

df['fare_bin'] = pd.cut(df['Fare'], bins=[0,10,20,30,45,60,75,100,600])

查看每个垃圾箱内的总计数和幸存数量（你可能会这样做作为单独的列，但我只是快速地执行它：

df.groupby('fare_bin').apply(lambda g: (g.shape[0], g.loc[g['Survived'] == 1, :].shape[0]))

Out[34]: 
fare_bin
(0, 10]           (7, 4)
(10, 20]          (9, 6)
(100, 600]    (326, 156)
(20, 30]          (5, 4)
(30, 45]         (12, 6)
(45, 60]        (15, 11)
(60, 75]         (13, 7)
(75, 100]        (13, 6)
dtype: object

然后编写一个快速函数来获得比率：

def get_ratio(g):
    try:
        return float(g.shape[0]) / g.loc[g['Survived'] == 1, :].shape[0]
    except ZeroDivisionError:
        return np.nan
df.groupby('fare_bin').apply(get_ratio)

Out[30]: 
fare_bin
(0, 10]       1.750000
(10, 20]      1.500000
(100, 600]    2.089744
(20, 30]      1.250000
(30, 45]      2.000000
(45, 60]      1.363636
(60, 75]      1.857143
(75, 100]     2.166667
dtype: float64

在熊猫中划分直方图

1 个答案: