我有一个pandas.DataFrame
有几列,有些列有连续数据,有些列有分类。我一直在尝试按类别分组,然后在每个类别中根据条件(即两个数字之间的值)拆分成数组
这是我写的一个蛮力hackjob做的工作,但我想知道是否有更优雅的方式。
import pandas as pd
df = pd.DataFrame({'Category1' : [ 0.3, 3.0, 12.4, 7.4,
20.3, 15.0, 10.9, 17.4],
'Category2' : [ 0, 0, 1, 0,
1, 1, 0, 0],
'Category3' : [ 1, 2, 3, 4,
5, 6, 7, 8],
'Category4' : ['foo','bar','fizz','buzz',
'spam','nii','blah','lol'],
etc., })
group_0_5 = df['Category1']<=5.0
group_5_10 = (df['Category1']>5.0) & (df['Category1']<=10.0)
group_10_15 = (df['Category1']>10.0) & (df['Category1']<=15.0)
group_15_20 = (df['Category1']>15.0) & df['Category1']<=20.0)
group_20_25 = (df['Category1']>20.0) & (df['Category1']<=25.0)
state1 = (df['Category2']==1)
state2 = (df['Category2']==0)
count1_state1 = df.loc[group_0_5 & state1]['Category3'].count()
count2_state1 = df.loc[group_5_10 & state1]['Category3'].count()
count3_state1 = df.loc[group_10_15 & state1]['Category3'].count()
count4_state1 = df.loc[group_15_20 & state1]['Category3'].count()
count5_state1 = df.loc[group_20_25 & state1]['Category3'].count()
count1_state2 = df.loc[group_0_5 & state2]['Category3'].count()
count2_state2 = df.loc[group_5_10 & state2]['Category3'].count()
count3_state2 = df.loc[group_10_15 & state2]['Category3'].count()
count4_state2 = df.loc[group_15_20 & state2]['Category3'].count()
count5_state2 = df.loc[group_20_25 & state2]['Category3'].count()
count_array1=[count1_state1, count2_state1, count3_state1, count4_state1, count5_state1]
count_array2=[count1_state2, count2_state2, count3_state2, count4_state2, count5_state2]
print (count_array1)
print (count_array2)
Out [2]:
[nan, nan, 2, 1, 1]
[ 2, 1, 1, 1, nan]
答案 0 :(得分:3)
我认为您需要cut
才能Category2
bins = [-np.inf, 5, 10, 15, 20, 25, np.inf]
bins = pd.cut(df['Category1'], bins=bins)
mux = pd.MultiIndex.from_product([bins.unique(), df['Category2'].unique()])
a = df.groupby([bins, df['Category2']])['Category3'].count().reindex(mux).unstack(0)
print (a)
(-inf, 5] (5, 10] (10, 15] (15, 20] (20, 25]
0 2.0 1.0 1.0 1.0 NaN
1 NaN NaN 2.0 NaN 1.0
#select by categories of column Category2
print (a.loc[0].values)
[ 2. 1. 1. 1. nan]
print (a.loc[1].values)
[ nan nan 2. nan 1.]
和NaN
列进行整理groupby
,并在count
之前添加缺失值}:
0
如果需要将fill_value=0
替换为reindex
,请将参数mux = pd.MultiIndex.from_product([bins.unique(), df['Category2'].unique()])
a = df.groupby([bins, df['Category2']])['Category3'].count()
.reindex(mux, fill_value=0)
.unstack(0)
print (a)
(-inf, 5] (5, 10] (10, 15] (15, 20] (20, 25]
0 2 1 1 1 0
1 0 0 2 0 1
print (a.loc[0].values)
[2 1 1 1 0]
print (a.loc[1].values)
[0 0 2 0 1]
添加到>>>print(*range(1,11))
1 2 3 4 5 6 7 8 9 10
:
{{1}}
同时检查reindex
答案 1 :(得分:2)
使用panda.cut()
和pandas.DataFrame.groupby
,您可以根据需要收集元素:
<强>代码:强>
groups = df.groupby(pd.cut(df['Category1'], [0, 5, 10, 15, 20, 25]))
group_size = groups['Category2'].count().values
group_ones = groups['Category2'].sum().values
print(list(group_ones))
print(list(group_size - group_ones))
<强>结果:强>
[0, 0, 2, 0, 1]
[2, 1, 1, 1, 0]
答案 2 :(得分:0)
再次,pd.cut使用groupby和set_index
df = df.groupby([pd.cut(df['Category1'], bins=bins, right = True), 'Category2']).Category3.count().reset_index()
df = df.set_index(['Category1', 'Category2']).unstack().reset_index(-1,drop=True)
count_array1 = df.loc[:, ('Category3', 1)].tolist()
print(count_array1)
[nan, nan, 2.0, nan, 1.0]
count_array2 = df.loc[:, ('Category3', 0)].tolist()
print(count_array2)
[2.0, 1.0, 1.0, 1.0, nan]