Python:基于Pandas中的2列进行分箱

时间:2017-09-28 15:20:20

标签: python pandas pandas-groupby binning

根据Pandas中的2列寻找快速优雅的bin方式。

这是我的数据框

                              filename  height   width
0        shopfronts_23092017_3_285.jpg   750.0   560.0
1                   shopfronts_200.jpg   4395.0  6020.0
2  shopfronts_25092017_eateries_98.jpg   414.0   621.0
3                   shopfronts_101.jpg   480.0   640.0
4                   shopfronts_138.jpg   3733.0  8498.0
5  shopfronts_25092017_eateries_95.jpg   187.0   250.0
6      shopfronts_25092017_neon_33.jpg   100.0   200.0
7                   shopfronts_322.jpg   682.0  1024.0
8                   shopfronts_171.jpg   800.0   600.0
9         shopfronts_23092017_3_35.jpg   120.0   210.0

我需要根据2列高度和1列来记录记录。宽度(图像分辨率)

我正在寻找类似的东西

                              filename  height   width    group
0        shopfronts_23092017_3_285.jpg   750.0   560.0       g3 
1                   shopfronts_200.jpg   4395.0  6020.0      g4  
2  shopfronts_25092017_eateries_98.jpg   414.0   621.0   others
3                   shopfronts_101.jpg   480.0   640.0   others
4                   shopfronts_138.jpg   3733.0  8498.0      g4
5  shopfronts_25092017_eateries_95.jpg   187.0   250.0       g1
6      shopfronts_25092017_neon_33.jpg   100.0   200.0       g1
7                   shopfronts_322.jpg   682.0  1024.0   others
8                   shopfronts_171.jpg   800.0   600.0       g3
9         shopfronts_23092017_3_35.jpg   120.0   210.0       g1

where 

g1: <= 400x300]
g2: (400x300, 640x480]
g3: (640x480, 800x600]
g4: > 800x600
others: If they don't comply to the requirement (Ex: records 7,2,3 - either height or width will fall in the categories defined but not both)

希望使用组列获取频率计数。如果这不是最佳方式,如果有更好的方法,请告诉我。

2 个答案:

答案 0 :(得分:3)

您可以使用双pd.cut

bins = [0,400,640,800,np.inf]
df['group'] = pd.cut(df['height'].values, bins,labels=["g1","g2","g3",'g4'])

nbin = [0,300,480,600,np.inf]
t = pd.cut(df['width'].values, nbin,labels=["g1","g2","g3",'g4'])

df['group'] =np.where(df['group'] == t,df['group'],'others')
                              filename  height   width  group
0        shopfronts_23092017_3_285.jpg   750.0   560.0      g3
1                   shopfronts_200.jpg  4395.0  6020.0      g4
2  shopfronts_25092017_eateries_98.jpg   414.0   621.0  others
3                   shopfronts_101.jpg   480.0   640.0  others
4                   shopfronts_138.jpg  3733.0  8498.0      g4
5  shopfronts_25092017_eateries_95.jpg   187.0   250.0      g1
6      shopfronts_25092017_neon_33.jpg   100.0   200.0      g1
7                   shopfronts_322.jpg   682.0  1024.0  others
8                   shopfronts_171.jpg   800.0   600.0      g3
9         shopfronts_23092017_3_35.jpg   120.0   210.0      g1

答案 1 :(得分:2)

使用np.where

In [4510]: df['group'] = np.where((df.height <= 400) & (df.width <= 300),
      ...:          'g1',
      ...:          np.where((df.height <= 640) & (df.width <= 480),
      ...:          'g2',
      ...:          np.where((df.height <= 800) & (df.width <= 600),
      ...:          'g3',
      ...:          np.where((df.height > 800) & (df.width > 600),
      ...:          'g4',
      ...:          'others'))))

In [4511]: df
Out[4511]:
                              filename  height   width   group
0        shopfronts_23092017_3_285.jpg   750.0   560.0      g3
1                   shopfronts_200.jpg  4395.0  6020.0      g4
2  shopfronts_25092017_eateries_98.jpg   414.0   621.0  others
3                   shopfronts_101.jpg   480.0   640.0  others
4                   shopfronts_138.jpg  3733.0  8498.0      g4
5  shopfronts_25092017_eateries_95.jpg   187.0   250.0      g1
6      shopfronts_25092017_neon_33.jpg   100.0   200.0      g1
7                   shopfronts_322.jpg   682.0  1024.0  others
8                   shopfronts_171.jpg   800.0   600.0      g3
9         shopfronts_23092017_3_35.jpg   120.0   210.0      g1