我正在通过pandas读取csv文件,并按如下方式制作简单的直方图:
df = pd.read_csv(sys.argv[1],header=0)
hFare = df['Fare'].dropna().hist(bins=[0,10,20,30,45,60,75,100,600],label = "All")
hSurFare = df[df.Survived==1]['Fare'].dropna().hist(bins=[0,10,20,30,45,60,75,100,600],label="Survivors")
我想要的是两个直方图的bin by bin比率。有一个简单的方法吗?
答案 0 :(得分:3)
首先,我们将创建一些示例数据。将来如果你问一个关于熊猫的问题,最好包括人们可以轻松地复制粘贴到他们的Python控制台的示例数据:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Fare': np.random.uniform(0, 600, 400),
'Survived': np.random.randint(0, 2, 400)})
然后使用pd.cut
以与您在直方图中相同的方式对数据进行分区:
df['fare_bin'] = pd.cut(df['Fare'], bins=[0,10,20,30,45,60,75,100,600])
查看每个垃圾箱内的总计数和幸存数量(你可能会这样做 作为单独的列,但我只是快速地执行它:
df.groupby('fare_bin').apply(lambda g: (g.shape[0], g.loc[g['Survived'] == 1, :].shape[0]))
Out[34]:
fare_bin
(0, 10] (7, 4)
(10, 20] (9, 6)
(100, 600] (326, 156)
(20, 30] (5, 4)
(30, 45] (12, 6)
(45, 60] (15, 11)
(60, 75] (13, 7)
(75, 100] (13, 6)
dtype: object
然后编写一个快速函数来获得比率:
def get_ratio(g):
try:
return float(g.shape[0]) / g.loc[g['Survived'] == 1, :].shape[0]
except ZeroDivisionError:
return np.nan
df.groupby('fare_bin').apply(get_ratio)
Out[30]:
fare_bin
(0, 10] 1.750000
(10, 20] 1.500000
(100, 600] 2.089744
(20, 30] 1.250000
(30, 45] 2.000000
(45, 60] 1.363636
(60, 75] 1.857143
(75, 100] 2.166667
dtype: float64