我从输入中创建了两个不同的数据集,以获得两种不同的度量。现在我需要将两个输入合并多于一列。我需要在参数上添加函数merge 所需的列。
我的代码:
import pandas as pn
df_csv = pn.read_csv('E:\\Sources\\BixiMontrealRentals2017\\OD_2017-06.csv',dtype={"user_id": int},low_memory= False,sep=',')
# data readiness for stations as starting
df_csv['start_date_dt']= pn.to_datetime(df_csv['start_date'],infer_datetime_format=True)
df_csv['start_day'] = df_csv['start_date_dt'].dt.weekday_name
df_csv['start_hour'] = df_csv['start_date_dt'].dt.hour
df_start = df_csv.drop(df_csv.columns[[0,2,3,4,5,6]],axis=1)
df_start_summ = df_start.groupby(['start_station_code', 'start_day','start_hour']).size().reset_index(name='start_counts')
print(df_start_summ.head())
# data readiness for stations as ending
df_csv['end_date_dt']= pn.to_datetime(df_csv['end_date'],infer_datetime_format=True)
df_csv['end_day'] = df_csv['end_date_dt'].dt.weekday_name
df_csv['end_hour'] = df_csv['end_date_dt'].dt.hour
df_end = df_csv.drop(df_csv.columns[[0,1,2,4,5,6,7,8,9]],axis=1)
df_end_summ = df_end.groupby(['end_station_code', 'end_day','end_hour']).size().reset_index(name='end_counts')
print(df_end_summ.head())
两个数据集的输出:
我理想的合并应该按站,天,小时应用。但是,每个数据集中的列具有不同的名称,我不知道如何指向所需的连接。
df_rowdata = pn.merge(df_start_summ,df_end_summ,
left_on= 'start_station_code', 'start_day','start_hour'
,how='inner')
我需要类似于T-SQL的东西:
left join
on start_station_code = end_station_code
and start_day = end_day
and start_hour = end_hour
感谢您的帮助和评论。
答案 0 :(得分:0)
您为pandas数据帧merge
使用的语法不太对。此外,您正在使用how='inner'
,但您要复制的SQL联接是left join
,因此您可能希望改为使用how='left'
。
尝试类似:
# Reproduce example dfs
import pandas as pd
df_start_summ = pd.DataFrame({'start_station_code':[5002]*5,
'start_day':['Friday']*5,
'start_hour':[6,8,9,12,14],
'start_counts':[1,1,1,1,2]
})[['start_station_code',
'start_day', 'start_hour',
'start_counts']]
df_end_summ = pd.DataFrame({'end_station_code':[5002]*5,
'end_day':['Friday']*5,
'end_hour':[4,8,12,13,15],
'end_counts':[1,1,1,1,1]
})[['end_station_code',
'end_day', 'end_hour',
'end_counts''']]
# inner merge (actually the default, you could omit the 'how='inner'')
inner = df_start_summ.merge(df_end_summ,
left_on=['start_station_code', 'start_day', 'start_hour'],
right_on=['end_station_code', 'end_day', 'end_hour'], how = 'inner')
# left merge:
left = df_start_summ.merge(df_end_summ,
left_on=['start_station_code', 'start_day', 'start_hour'],
right_on=['end_station_code', 'end_day', 'end_hour'], how = 'left')
这导致:
>>> inner_merge
start_station_code start_day start_hour start_counts end_station_code \
0 5002 Friday 8 1 5002
1 5002 Friday 12 1 5002
end_day end_hour end_counts
0 Friday 8 1
1 Friday 12 1
>>> left_merge
start_station_code start_day start_hour start_counts end_station_code \
0 5002 Friday 6 1 NaN
1 5002 Friday 8 1 5002.0
2 5002 Friday 9 1 NaN
3 5002 Friday 12 1 5002.0
4 5002 Friday 14 2 NaN
end_day end_hour end_counts
0 NaN NaN NaN
1 Friday 8.0 1.0
2 NaN NaN NaN
3 Friday 12.0 1.0
4 NaN NaN NaN
另请查看合并的pandas documentation。