在大熊猫中如何调用函数

时间:2019-05-25 17:58:08

标签: python pandas

我在熊猫中有以下数据框

code     job_descr               job_type     
123      sales executive         nan
124      data scientist          nan
145      marketing manager       nan
132      finance                 nan
144      data analyst            nan

我要将job_descr划分为job_type如下

sales : Sales
marketing : Marketing
finance : Finance
data science : Analytics
analyst : Analytics

我正在熊猫追随

def job_type_redifine(column_name):
   if column_name.str.contains('sales'):
       return 'Sales'
   elif column_name.str.contains('marketing'):
       return 'Marketing'
   elif column_name.str.contains('data science|data scientist|analyst|machine learning'):
    return 'Analytics'
   else:
       return 'Others'


final_df['job_type'] = final_df.apply(lambda row: 
                       job_type_redifine(row['job_descr']), axis=1)

所需数据框

code     job_descr               job_type     
123      sales executive         Sales
124      data scientist          Analytics
145      marketing manager       Marketing
132      finance                 Finance
144      data analyst            Analytics

1 个答案:

答案 0 :(得分:1)

第一个解决方案是使用numpy.selectSeries.str.contains,advatage正在处理缺少的值,但速度较慢:

Customer.create(attribute=value,attribute2=value2,..etc)

使用Series.apply的解决方案-对于测试匹配值,请使用m1 = final_df['job_descr'].str.contains('sales') m2 = final_df['job_descr'].str.contains('marketing') m3 = final_df['job_descr'].str.contains('data science|data scientist|analyst|machine learning') final_df['job_type'] = np.select([m1, m2, m3], ['Sales','Marketing','Analytics'], default='Others') print (final_df) code job_descr job_type 0 123 sales executive Sales 1 124 data scientist Analytics 2 145 marketing manager Marketing 3 132 finance Others 4 144 data analyst Analytics ,这里是每个值的循环,但是它更快,因为pandas文本功能很慢。失败是许多in的最后一个复杂条件:

or

性能

def job_type_redifine(column_name):
   if 'sales' in column_name:
       return 'Sales'
   elif 'marketing' in column_name:
       return 'Marketing'
   elif  ('data science' in column_name or 'data scientist' in column_name 
         or 'analyst' in column_name or 'machine learning' in column_name):
      return 'Analytics'
   else:
       return 'Others'


final_df['job_type'] =  final_df['job_descr'].apply(job_type_redifine)
print (final_df)
   code          job_descr   job_type
0   123    sales executive      Sales
1   124     data scientist  Analytics
2   145  marketing manager  Marketing
3   132            finance     Others
4   144       data analyst  Analytics