我有一个很大的数据帧(> 30M行),我需要根据其他列的条件和值创建一堆列。我之前已经使用apply
和map
方法做到了这一点,但是在如此大的整个数据帧上使用它的效率非常低而且很慢。我正在寻找更快,可扩展性更高的替代方案。
这是数据帧的标题
2019_date | Carrier | Service_y | ship_from_location
2019-12-17 | USPS | PM | ECFC
和我尝试的代码:
def cut_off(row):
if (row['2019_date']>='2019-12-17' and row['Carrier']=='USPS' and row['Service_y']=='FCPS'):
return 'disable'
if (row['2019_date']>='2019-12-19' and row['Carrier']=='USPS' and row['Service_y']=='PM'):
return 'disable'
if (row['2019_date']>='2019-12-19' and row['ship_from_location']=='ECFC' and row['Carrier']=='UDSL'):
return 'disable'
if (row['2019_date']>='2019-12-19' and row['ship_from_location']=='MWFC' and row['Carrier']=='EMSY'):
return 'disable'
if (row['2019_date']>='2019-12-19' and row['ship_from_location']=='ECFC' and row['Carrier']=='LASG'):
return 'disable'
if (row['2019_date']>='2019-12-19' and row['ship_from_location']=='beauty_659' and row['Carrier']=='LASG'):
return 'disable'
if (row['2019_date']>='2019-12-19' and row['ship_from_location']=='RDR_699' and row['Carrier']=='LASG'):
return 'disable'
if (row['2019_date']>='2019-12-22' and row['ship_from_location']=='ECFC' and row['Carrier']=='CDDT'):
return 'disable'
if (row['2019_date']>='2019-12-22' and row['ship_from_location']=='beauty_659' and row['Carrier']=='CDDT'):
return 'disable'
if (row['2019_date']>='2019-12-22' and row['ship_from_location']=='RDR_699' and row['Carrier']=='CDDT'):
return 'disable'
if (row['Normalized_Service'] in (['3D', '1D', '2D']) and row['ship_from_location']=='beauty_659' and row['Carrier']!='UPSN'):
return 'disable'
if (row['Normalized_Service'] in (['3D', '1D', '2D']) and row['ship_from_location']=='beauty_489' and row['Carrier']!='UPSN'):
return 'disable'
else:
return 'eligible'
dataframe['eligibility'] = dataframe.apply (lambda row: cut_off (row),axis=1)