这只是我庞大的数据框的一小部分:
data = {'index': ['001', '001', '002', '002', '003', '003', '003', '004', '004', '004', '004', '005'],
'vehicle': ['none', 'car', 'car', 'car', 'none', 'car', 'car', 'car', 'bus', 'bus', 'bus', 'motorcycle'],
'cas_class': ['pedestrian', 'driver', 'driver', 'passenger', 'pedestrian', 'driver', 'driver', 'driver', 'driver', 'passenger', 'passenger', 'driver']}
df = pd.DataFrame (data, columns = ['index', 'vehicle', 'cas_class'])
df
输出:
index vehicle cas_class
0 001 none pedestrian
1 001 car driver
2 002 car driver
3 002 car passenger
4 003 none pedestrian
5 003 car driver
6 003 car driver
7 004 car driver
8 004 bus driver
9 004 bus passenger
10 004 bus passenger
11 005 motorcycle driver
通过此代码,我设法做到了:
n_of_veh = df[df["cas_class"]=='driver'].groupby(['index']).size().reset_index()\
.rename(columns= {0: 'n_of_veh'})
df = df.merge(n_of_veh, on='index')
有人可以帮忙吗?我还在学习熊猫。 :/
期望的输出:
index vehicle class n_of_veh reference
0 001 none pedestrian 1 1
1 001 car driver 1 1
2 002 car driver 1 1
3 002 car passenger 1 1
4 003 none pedestrian 2 1
5 003 car driver 2 1
6 003 car driver 2 2
7 004 car driver 2 1
8 004 bus driver 2 2
9 004 bus passenger 2 2
10 004 bus passenger 2 2
11 005 motorcycle driver 1 1
# columns translation
df = df.rename({'FECHA': 'Date', 'RANGO HORARIO': 'Hour', 'DIA SEMANA': 'Day_of_Week', 'DISTRITO': 'District', 'LUGAR ACCIDENTE': 'Street_Address', 'Nº': 'Street_Number', 'Nº PARTE': 'Accident_Index', 'CPFA Granizo': 'WC_Hail', 'CPFA Hielo': 'WC_Ice', 'CPFA Lluvia': 'WC_Rainy', 'CPFA Niebla': 'WC_Foggy', 'CPFA Seco': 'WC_Dry', 'CPFA Nieve': 'WC_Snowy', 'CPSV Mojada': 'RC_Wet', 'CPSV Aceite': 'RC_Oil', 'CPSV Barro': 'RC_Mud', 'CPSV Grava Suelta': 'RC_Gravel_Loose', 'CPSV Hielo': 'RC_Ice', 'CPSV Seca Y Limpia': 'RC_Dry_and_Clean', '* Nº VICTIMAS': 'Number_of_Casualties', 'TIPO ACCIDENTE': 'Collision_Type', 'Tipo Vehiculo': 'Vehicle_Type', 'TIPO PERSONA': 'Casualty_Class', 'SEXO': 'Sex_of_Casualty', 'LESIVIDAD': 'Casualty_Severity', 'Tramo Edad': 'Age_Band_of_Casualty'}, axis=1)
# removing columns that are not needed
df = df.drop(columns= ['Date', 'Hour', 'Day_of_Week', 'District', 'Street_Address', 'Street_Number', 'WC_Hail', 'WC_Ice', 'WC_Rainy', 'WC_Foggy', 'WC_Dry', 'WC_Snowy', 'RC_Wet', 'RC_Oil', 'RC_Mud', 'RC_Gravel_Loose', 'RC_Ice', 'RC_Dry_and_Clean', 'Sex_of_Casualty', 'Age_Band_of_Casualty'])
# other translations
collision_dict = {'COLISION DOBLE': 'Double_Collision', 'ATROPELLO': 'Pedestrian_Hit', 'COLISION MULTIPLE': 'Multiple_Collision', 'CAIDA MOTOCICLETA': 'Motorcycle_Fall', 'CHOQUE CON OBJETO FIJO': 'Accident_with_a_Fixed_Object', 'CAIDA VIAJERO BUS': 'Bus_Passenger_Fall', 'CAIDA BICICLETA': 'Bicycle_Fall', 'CAIDA CICLOMOTOR': 'Moped_Fall', 'OTRAS CAUSAS': 'Other_Types', 'VUELCO': 'Ended_on_the_Roof', 'CAIDA VEHICULO 3 RUEDAS': '3-Wheel_Vehicle_Fall'}
vehtype_dict = {'TURISMO': 'OtherVehicles', 'NO ASIGNADO': 'Not_Assigned', 'MOTOCICLETA': 'Motorcycle', 'FURGONETA': 'GoodsVehicle', 'AUTOBUS-AUTOCAR': 'BusCoach', 'AUTO-TAXI': 'CarTaxi', 'BICICLETA': 'Bicycle', 'CICLOMOTOR': 'Motorcycle', 'CAMION': 'GoodsVehicle', 'VARIOS': 'OtherVehicles', 'AMBULANCIA': 'OtherVehicles', 'VEH.3 RUEDAS': 'OtherVehicles'}
cclass_dict = {'CONDUCTOR': 'Driver', 'VIAJERO': 'Passenger', 'TESTIGO': 'Witness', 'PEATON': 'Pedestrian'}
csev_dict = {'HL' : 'Slight', 'HG': 'Serious', 'MT': 'Fatal', 'NO ASIGNADA': 'Not_Assigned', 'IL': 'Not_Injured'}
df.replace({'Collision_Type': collision_dict}, regex=True, inplace=True)
df.replace({'Vehicle_Type': vehtype_dict}, regex=True, inplace=True)
df.replace({'Casualty_Class': cclass_dict}, regex=True, inplace=True)
df.replace({'Casualty_Severity': csev_dict}, regex=True, inplace=True)
答案 0 :(得分:1)
此解决方案回答了OP在连续编辑中发布的原始数据集。变量名称将由原始数据而不是问题开头的示例数据选择。
1)我们首先创建一个查找数据框,其中具有每个事件的车辆数量,然后将左联接与原始数据框df
df_n_of_veh = df.groupby(['Accident_Index']).apply(lambda x: sum(x['Casualty_Class'] == 'Driver')).to_frame(name='n_of_veh').reset_index()
df = df.merge(df_n_of_veh, how='left', on='Accident_Index')
2)此解决方案假定每个驾驶员下方都有乘客。基本上,每次事件中'Driver'
中有Casualty_Class
时,我们就增加1。此后,由于乘客或证人仍然会有0,因此我们将其替换为1。这次我们使用行索引进行合并,因为它们在df_ref
df_ref = df.groupby('Accident_Index').apply(lambda x: (x['Casualty_Class'] == 'Driver').astype(int).cumsum()).to_frame(name='reference').reset_index(level='Accident_Index').drop(['Accident_Index'], axis=1)
df_ref.loc[df_ref['reference'] == 0, 'reference'] = 1
df = df.merge(df_ref, left_index=True, right_index=True)
输出
(对不起,我相信这是显示它的最短方法)