用熊猫清理数据:如果空值包含在另一列中,则用特定的字符串替换空值

时间:2020-02-25 17:53:58

标签: python pandas data-cleaning

我目前正在研究汽车排放数据集,该数据集用于清理/标准化汽车型号名称。数据集很大,但是这里是前10行:

cars_em_df = pd.DataFrame({'manufacturer_name_mapped': ['FIAT', 'FIAT','FIAT','FIAT','FIAT','BMW AG','BMW AG','BMW AG','BMW AG','BMW AG'],
'commercial_name':['124 gt multiair auto', '500l wagon pop star t-jet', 
'doblo combi 1.4 95', 'panda  0.9t sge 85 natural power', 'punto 1.4  77 lpg', 'x4 xdrive20d se auto', '216d active tourer b37 f45','220d gran tourer b47 f46','x1 xdrive18d sport','320i xdrive m sport gt auto'],
'fuel_type_mapped':['Petrol', 'Petrol', 'Petrol', 'NG-Biomethane', 'LPG','Diesel','Diesel','Diesel','Diesel','Petrol'],
'file_year':[2018, 2018, 2018, 2018, 2018,2018, 2018, 2018, 2018, 2018], 'emissions': [153,158,165,86,114,131,166,200,151,149], 'commercial_name_cleaned':['124','500',None,'panda','punto','x4',None,None,'x1',None]})  

右侧列'commercial_name_cleaned'是我第一次清理活动的结果,其中我将'commercial_name'列中的名称与标准列表匹配来自不同来源的名称。如您所见,它们是非常简单和简短的名称。每当我无法匹配模型名称时,我的函数就会返回“无”。

第二步,我现在要执行以下操作:如果为“ None”,则在相邻的“ commercial_name” 列中搜索特定的字符串,并将其替换为模型名称I指定。我尝试过:

    def str_ops(commercial_name_cleaned,commercial_name):
          if commercial_name_cleaned == None:
             if '216' in commercial_name:
                return '2-series'
             elif '220' in commercial_name:
                return '2-series'
             elif '320' in commercial_name:
                return '3-series'

然后我将此功能应用于数据框:

cars_em_df['commercial_name_cleaned'] = cars_em_df.apply(lambda x: str_ops(str(x.commercial_name_cleaned), str(x.commercial_name)), axis=1)

重要的是要注意,如果在'commercial_name'中找不到'320'或'220'等,该函数不应更改任何内容,而只是返回中已经存在的值>“ commercial_name_cleaned” 。但是,当我应用该函数时,整个'commercial_name_cleaned'列仅变为“无”值。因此,该功能一定存在问题。有谁知道如何解决这个问题?

非常感谢您的帮助,谢谢!

1 个答案:

答案 0 :(得分:0)

您正在None列中获得commercial_name_cleaned值,因为您没有从函数str_ops返回任何内容,当您未隐式返回任何内容{{1}时}类型返回。

替换:

None

使用方式:

def str_ops(commercial_name_cleaned,commercial_name):
    if commercial_name_cleaned == None:
        if '216' in commercial_name:
            return '2-series'
        elif '220' in commercial_name:
            return '2-series'
        elif '320' in commercial_name:
            return '3-series'

输出:

def str_ops(commercial_name_cleaned,commercial_name):
    if commercial_name_cleaned == 'None':
        if '216' in commercial_name:
            return '2-series'
        elif '220' in commercial_name:
            return '2-series'
        elif '320' in commercial_name:
            return '3-series'
    else:
        return commercial_name_cleaned