我正在研究具有以下列的汽车销售数据集:'car','price','body','mileage','engV','engType','registration','year','model', “驱动器”
列'drive'和'engType'缺少NaN值,我想基于['car','model']的分组依据来计算'drive'的模式,然后该组落在哪里要根据此分组依据替换NaN值
我尝试了以下方法:
用于数字数据
carsale ['engV2'] =(carsale.groupby(['car','body','model']))['engV']。transform(lambda x:x.fillna(x.median() ))
用于分类数据
carsale ['driveT'] =(carsale.groupby(['car','model']))['drive']。transform(lambda x:x.fillna(x.mode())) carsale ['driveT'] =(carsale.groupby(['car','model']))['drive']。transform(lambda x:x.fillna(pd.Series.mode(x)))
两者都给出相同的结果
# carsale['price2'] = (carsale.groupby(['car','model','year']))['price'].transform(lambda x: x.fillna(x.median()))
# carsale['engV2'] = (carsale.groupby(['car','body','model']))['engV'].transform(lambda x: x.fillna(x.median()))
# carsale['mileage2'] = (carsale.groupby(['car','model','year']))['mileage'].transform(lambda x: x.fillna(x.median()))
# mode = carsale.filter(['car','drive']).mode()
# carsale[['test1','test2']] = carsale[['car','engType']].fillna(carsale.mode().iloc[0])
**carsale.groupby(['car', 'model'])['engType'].apply(pd.Series.mode)**
# carsale.apply()
# carsale
# carsale['engType2'] = carsale.groupby('car').engType.transform(lambda x: x.fillna(x.mode()))
**carsale['driveT'] = carsale.groupby(['car', 'model'])['drive'
].transform(lambda x: x.fillna(x.mode()))
carsale['driveT'] = carsale.groupby(['car', 'model'])['drive'
].transform(lambda x: x.fillna(pd.Series.mode(x)))**
# carsale[carsale.car == 'Mercedes-Benz'].sort_values(['body','engType','model','mileage']).tail(50)
# carsale[carsale.engV.isnull()]
# carsale.sort_values(['car','body','model'])
**carsale**
两种方法的给出相同的结果,只是替换/添加新列driveT中的值,与原始列'drive'中的值相同。例如,如果我们在某些索引中包含NaN,那么它在driveT中也显示相同的NaN,而对于其他值也显示相同。
但是对于数值数据,如果我应用中位数,则它会添加/替换正确的值。
所以实际上不是基于['car','model']组的模式,而是针对'drive'中的单个值进行模式,但是如果您运行此命令
**carsale.groupby(['car','model'])['engType'].apply(pd.Series.mode)**
这是基于groupby(汽车,模型)的正确计算模式
有人可以帮忙吗?
答案 0 :(得分:1)
我的方法是:
drive
/ car
组合的model
功能的模式。 drive
中该汽车/模型的值为空时,编写一个方法在该数据帧中查找模式并针对给定的汽车/模型返回该模式。但是,事实证明,有两个特定于OP数据集的关键案例需要处理:
drive
列中的所有条目均为NaN)。下面是我遵循的步骤。如果我从问题的示例数据帧的前几行开始扩展一个示例:
carsale = pd.DataFrame({'car': ['Ford', 'Mercedes-Benz', 'Mercedes-Benz', 'Mercedes-Benz', 'Mercedes-Benz', 'Nissan', 'Honda','Renault', 'Mercedes-Benz', 'Mercedes-Benz', 'Toyota', 'Toyota', 'Ferrari'],
'price': [15500.000, 20500.000, 35000.000, 17800.000, 33000.000, 16600.000, 6500.000, 10500.000, 21500.000, 21500.000, 1280.000, 2005.00, 300000.000],
'body': ['crossover', 'sedan', 'other', 'van', 'vagon', 'crossover', 'sedan', 'vagon', 'sedan', 'sedan', 'compact', 'compact', 'sport'],
'mileage': [68.0, 173.0, 135.0, 162.0, 91.0, 83.0, 199.0, 185.0, 146.0, 146.0, 200.0, 134, 123.0],
'engType': ['Gas', 'Gas', 'Petrol', 'Diesel', np.nan, 'Petrol', 'Petrol', 'Diesel', 'Gas', 'Gas', 'Hybrid', 'Gas', 'Gas'],
'registration':['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes'],
'year': [2010, 2011, 2008, 2012, 2013, 2013, 2003, 2011, 2012, 2012, 2009, 2003, 1988],
'model': ['Kuga', 'E-Class', 'CL 550', 'B 180', 'E-Class', 'X-Trail', 'Accord', 'Megane', 'E-Class', 'E-Class', 'Prius', 'Corolla', 'Testarossa'],
'drive': ['full', 'rear', 'rear', 'front', np.nan, 'full', 'front', 'front', 'rear', np.nan, np.nan, 'front', np.nan],
})
carsale
car price body mileage engType registration year model drive
0 Ford 15500.0 crossover 68.0 Gas yes 2010 Kuga full
1 Mercedes-Benz 20500.0 sedan 173.0 Gas yes 2011 E-Class rear
2 Mercedes-Benz 35000.0 other 135.0 Petrol yes 2008 CL 550 rear
3 Mercedes-Benz 17800.0 van 162.0 Diesel yes 2012 B 180 front
4 Mercedes-Benz 33000.0 vagon 91.0 NaN yes 2013 E-Class NaN
5 Nissan 16600.0 crossover 83.0 Petrol yes 2013 X-Trail full
6 Honda 6500.0 sedan 199.0 Petrol yes 2003 Accord front
7 Renault 10500.0 vagon 185.0 Diesel yes 2011 Megane front
8 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class rear
9 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class NaN
10 Toyota 1280.0 compact 200.0 Hybrid yes 2009 Prius NaN
11 Toyota 2005.0 compact 134.0 Gas yes 2003 Corolla front
12 Ferrari 300000.0 sport 123.0 Gas yes 1988 Testarossa NaN
为每个drive
/ car
组合创建一个数据框,以显示model
功能的模式。
如果汽车/模型组合没有模式(例如Toyota Prius所在的行),则填写该特定汽车品牌(丰田)的模式。
但是,如果汽车品牌本身(例如本例中的Ferrari)没有模式,我将使用drive
功能的数据集模式。
def get_drive_mode(x):
brand = x.name[0]
if x.count() > 0:
return x.mode() # Return mode for a brand/model if the mode exists.
elif carsale.groupby(['car'])['drive'].count()[brand] > 0:
brand_mode = carsale.groupby(['car'])['drive'].apply(lambda x: x.mode())[brand]
return brand_mode # Return mode of brand if particular brand/model combo has no mode,
else: # but brand itself has a mode for the 'drive' feature.
return carsale['drive'].mode() # Otherwise return dataset's mode for the 'drive' feature.
drive_modes = carsale.groupby(['car','model'])['drive'].apply(get_drive_mode).reset_index().drop('level_2', axis=1)
drive_modes.rename(columns={'drive': 'drive_mode'}, inplace=True)
drive_modes
car model drive_mode
0 Ferrari Testarossa front
1 Ford Kuga full
2 Honda Accord front
3 Mercedes-Benz B 180 front
4 Mercedes-Benz CL 550 rear
5 Mercedes-Benz E-Class rear
6 Nissan X-Trail full
7 Renault Megane front
8 Toyota Corolla front
9 Toyota Prius front
drive
模式值,如果该行的drive
值为NaN:def fill_with_mode(x):
if pd.isnull(x['drive']):
return drive_modes[(drive_modes['car'] == x['car']) & (drive_modes['model'] == x['model'])]['drive_mode'].values[0]
else:
return x['drive']
carsale
数据框中的行,以创建driveT
功能:carsale['driveT'] = carsale.apply(fill_with_mode, axis=1)
del(drive_modes)
这将导致以下数据帧:
carsale
car price body mileage engType registration year model drive driveT
0 Ford 15500.0 crossover 68.0 Gas yes 2010 Kuga full full
1 Mercedes-Benz 20500.0 sedan 173.0 Gas yes 2011 E-Class rear rear
2 Mercedes-Benz 35000.0 other 135.0 Petrol yes 2008 CL 550 rear rear
3 Mercedes-Benz 17800.0 van 162.0 Diesel yes 2012 B 180 front front
4 Mercedes-Benz 33000.0 vagon 91.0 NaN yes 2013 E-Class NaN rear
5 Nissan 16600.0 crossover 83.0 Petrol yes 2013 X-Trail full full
6 Honda 6500.0 sedan 199.0 Petrol yes 2003 Accord front front
7 Renault 10500.0 vagon 185.0 Diesel yes 2011 Megane front front
8 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class rear rear
9 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class NaN rear
10 Toyota 1280.0 compact 200.0 Hybrid yes 2009 Prius NaN front
11 Toyota 2005.0 compact 134.0 Gas yes 2003 Corolla front front
12 Ferrari 300000.0 sport 123.0 Gas yes 1988 Testarossa NaN front
请注意,在driveT
列的第4行和第9行中,drive
列中的NaN值已被字符串rear
取代,正如我们所期望的那样,是奔驰E级轿车的drive
模式。
在第11行中,由于没有丰田普锐斯汽车/模型组合的模式,因此我们填充了丰田品牌front
的模式。
最后,在第12行中,由于没有法拉利汽车品牌的模式,因此我们填充了整个数据集的drive
列的模式,该列也为front
。