从不同列的断点获取行数

时间:2018-05-21 07:55:11

标签: pandas numpy breakpoints

考虑数据帧中有两列A和B列。如何对列A进行十进制并使用列A的十进制断点来计算列B中的行数?

import pandas as pd
import numpy as np

df=pd.read_excel("E:\Sai\Development\UCG\qcut.xlsx")

df['Range']=pd.qcut(df['a'],10)

df_gb=df.groupby('Range',as_index=False).agg({'a':[min,max,np.size]})

df_gb.columns = df_gb.columns.droplevel()
df_gb=df_gb.rename(columns={'':'Range','size':'count_A'})

df['Range_B']=0
df['Range_B'].loc[df['b']<=df_gb['max'][0]]=1
df['Range_B'].loc[(df['b']>df_gb['max'][0]) & (df['b']<=df_gb['max'][1])]=2
df['Range_B'].loc[(df['b']>df_gb['max'][1]) & (df['b']<=df_gb['max'][2])]=3
df['Range_B'].loc[(df['b']>df_gb['max'][2]) & (df['b']<=df_gb['max'][3])]=4
df['Range_B'].loc[(df['b']>df_gb['max'][3]) & (df['b']<=df_gb['max'][4])]=5
df['Range_B'].loc[(df['b']>df_gb['max'][4]) & (df['b']<=df_gb['max'][5])]=6
df['Range_B'].loc[(df['b']>df_gb['max'][5]) & (df['b']<=df_gb['max'][6])]=7
df['Range_B'].loc[(df['b']>df_gb['max'][6]) & (df['b']<=df_gb['max'][7])]=8
df['Range_B'].loc[(df['b']>df_gb['max'][7]) & (df['b']<=df_gb['max'][8])]=9
df['Range_B'].loc[df['b']>df_gb['max'][8]]=10

df_gb_b=df.groupby('Range_B',as_index=False).agg({'b':np.size})

df_gb_b=df_gb_b.rename(columns={'b':'count_B'})

df_final = pd.concat([df_gb, df_gb_b], axis=1)

df_final=df_final[['Range','count_A','count_B']]

是否有任何简单的解决方案,正如我打算为这么多列做的那样

2 个答案:

答案 0 :(得分:2)

我希望这会有所帮助:

df['Range'] = pd.qcut(df['a'], 10)
df2 = df.groupby(['Range'])['a'].count().reset_index().rename(columns = {'a':'count_A'})

for item in df2['Range'].values:
    df2.loc[df2['Range'] == item, 'count_B'] = df['b'].apply(lambda x: x in item).sum()

df2 = df2.sort_values('Range', ascending = True)

如果您想额外计算超出范围 a 的值 b

min_border = df2['Range'].values[0].left
max_border = df2['Range'].values[-1].right

df2.loc[0, 'count_B'] += df.loc[df['b'] <= min_border, 'b'].count()
df2.iloc[-1, 2] += df.loc[df['b'] >  max_border, 'b'].count()

答案 1 :(得分:1)

单向 -

df = pd.DataFrame({'A': np.random.randint(0, 100, 20), 'B': np.random.randint(0, 10, 20)})
bins = [0, 1, 4, 8, 16, 32, 60, 100, 200, 500, 5999]
labels = ["{0} - {1}".format(i, j) for i, j in zip(bins, bins[1:])]

df['group_A'] = pd.cut(df['A'], bins, right=False, labels=labels)
df['group_B'] = pd.cut(df.B, bins, right=False, labels=labels)

df1 = df.groupby(['group_A'])['A'].count().reset_index()
df2 = df.groupby(['group_B'])['B'].count().reset_index()

df_final = pd.merge(df1, df2, left_on =['group_A'], right_on =['group_B']).drop(['group_B'], axis=1).rename(columns={'group_A': 'group'})
print(df_final)

<强>输出

        group  A  B
0       0 - 1  0  1
1       1 - 4  1  3
2       4 - 8  1  9
3      8 - 16  2  7
4     16 - 32  3  0
5     32 - 60  7  0
6    60 - 100  6  0
7   100 - 200  0  0
8   200 - 500  0  0
9  500 - 5999  0  0