我有一个pandas
。DataFrame
,其中包含以下数据:
country branch Name salary mobile no emailid
x a aa 250000 Null Null
x b bb 350000 8976646410 xx@xx.com
y c cc 450000 8777945411 yy@yy.com
y d dd 589630 Null Null
根据某些条件,我会过滤DataFrame
(伪代码):
if salary <= 250000: Normal Employee elif salary >= 250000 and salary <= 600000: Experienced Employee
在执行此操作时,我添加了一个新列,如下所示:
normal = data_df['salary'] <= 250000
experienced = (data_df['salary'] > 250000) & \
(data_df['customer_total_sales'] <= 600000)
data_df['position'] = np.where(normal, 'normal',
np.where(experienced, 'experienced','unknown'))
但是,我想显示DataFrame
,如下所示,删除值为Null
的行:
country branch count_employee count_mobile_no count_email_id count_normal _employee count_experienced_employee
x a 1 0 0 1 0
y c 1 1 1 0 1
要计算字段,我使用以下代码:
a = {'employee': ['count'],
'mobile_number': ['count'],
'customer_emailid': ['count']}
答案 0 :(得分:2)
您可replace
Null
至NaN
,然后groupby
与agg
和reset_index
:
print data_df
country branch Name salary mobile no emailid position
0 x a aa 250000 Null Null unknown
1 x b bb 350000 8976646410 xx@xx.com unknown
2 y c cc 450000 8777945411 yy@yy.com unknown
3 y d dd 589630 Null Null unknown
data_df = data_df.replace('Null', np.nan)
print data_df
country branch Name salary mobile no emailid position
0 x a aa 250000 NaN NaN unknown
1 x b bb 350000 8976646410 xx@xx.com unknown
2 y c cc 450000 8777945411 yy@yy.com unknown
3 y d dd 589630 NaN NaN unknown
df = data_df.groupby(['country', 'branch']).agg({'Name': 'count',
'mobile no':'count',
'emailid': 'count',
'position': 'count'})
print df.reset_index()
country branch emailid position Name mobile no
0 x a 0 1 1 0
1 x b 1 1 1 1
2 y c 1 1 1 1
3 y d 0 1 1 0
编辑:
如果您需要按category
计算排名,请为每个类别创建columns
,然后groupby
count
,drop
列salary
,最后reset_index
:
print data_df
country branch Name salary mobile no emailid
0 x a aa 250000 Null Null
1 x a aa 20000 Null Null
2 x b bb 350000 8976646410 xx@xx.com
3 y c cc 45000 8777945411 yy@yy.com
4 y d dd 589630 Null Null
normal = data_df['salary'] <= 20000
experienced = (data_df['salary'] > 20000) & (data_df['salary'] <= 50000)
unknown = data_df['salary'] > 50000
data_df.loc[normal, 'position_normal'] = 'normal employee'
data_df.loc[experienced,'position_experienced'] = 'experienced employee'
data_df.loc[unknown,'position_unknown'] = 'unknown employee'
print data_df
country branch Name salary mobile no emailid position_normal \
0 x a aa 250000 Null Null NaN
1 x a aa 20000 Null Null normal employee
2 x b bb 350000 8976646410 xx@xx.com NaN
3 y c cc 45000 8777945411 yy@yy.com NaN
4 y d dd 589630 Null Null NaN
position_experienced position_unknown
0 NaN unknown employee
1 NaN NaN
2 NaN unknown employee
3 experienced employee NaN
4 NaN unknown employee
#replace Null to NaN
data_df = data_df.replace('Null', np.nan)
df = data_df.groupby(['country', 'branch']).count()
#remove column salary
df = df.drop('salary', axis=1)
df = df.reset_index()
print df
country branch Name mobile no emailid position_normal \
0 x a 2 0 0 1
1 x b 1 1 1 0
2 y c 1 1 1 0
3 y d 1 0 0 0
position_experienced position_unknown
0 0 1
1 0 1
2 1 0
3 0 1