Question

我有一个很大的数据帧（> 30M行），我需要根据其他列的条件和值创建一堆列。我之前已经使用apply和map方法做到了这一点，但是在如此大的整个数据帧上使用它的效率非常低而且很慢。我正在寻找更快，可扩展性更高的替代方案。

这是数据帧的标题

2019_date  | Carrier  | Service_y | ship_from_location
2019-12-17 | USPS     | PM        | ECFC

和我尝试的代码：

def cut_off(row):
    if (row['2019_date']>='2019-12-17' and row['Carrier']=='USPS' and row['Service_y']=='FCPS'):
        return 'disable'
    if (row['2019_date']>='2019-12-19' and row['Carrier']=='USPS' and row['Service_y']=='PM'):
        return 'disable'    
    if (row['2019_date']>='2019-12-19' and row['ship_from_location']=='ECFC' and row['Carrier']=='UDSL'):
        return 'disable'
    if (row['2019_date']>='2019-12-19' and row['ship_from_location']=='MWFC' and row['Carrier']=='EMSY'):
        return 'disable'
    if (row['2019_date']>='2019-12-19' and row['ship_from_location']=='ECFC' and row['Carrier']=='LASG'):
        return 'disable'
    if (row['2019_date']>='2019-12-19' and row['ship_from_location']=='beauty_659' and row['Carrier']=='LASG'):
        return 'disable'
    if (row['2019_date']>='2019-12-19' and row['ship_from_location']=='RDR_699' and row['Carrier']=='LASG'):
        return 'disable'
    if (row['2019_date']>='2019-12-22' and row['ship_from_location']=='ECFC' and row['Carrier']=='CDDT'):
        return 'disable'
    if (row['2019_date']>='2019-12-22' and row['ship_from_location']=='beauty_659' and row['Carrier']=='CDDT'):
        return 'disable'
    if (row['2019_date']>='2019-12-22' and row['ship_from_location']=='RDR_699' and row['Carrier']=='CDDT'):
        return 'disable'
    if (row['Normalized_Service'] in (['3D', '1D', '2D']) and row['ship_from_location']=='beauty_659' and row['Carrier']!='UPSN'):
        return 'disable'
    if (row['Normalized_Service'] in (['3D', '1D', '2D']) and row['ship_from_location']=='beauty_489' and row['Carrier']!='UPSN'):
        return 'disable'
    else:
        return 'eligible'

dataframe['eligibility'] = dataframe.apply (lambda row: cut_off (row),axis=1)

与lambda结合使用以在大型数据帧上进行更快处理的替代方法是什么？

0 个答案: