我有一个约200万行8列的数据框。看起来像这样:
df1:
id, name, visitdate, locationvisited, someotherdate, somebit
8, 16, 2017-09-12 11:09:00.000, 18, 2018-12-28 14:27:28.503, 0
9, 18, 2019-06-11 09:56:14.663, 18, 2019-06-11 09:56:14.663, 1
...
我还有另一个带有用户定义日期范围的数据框。看起来像这样:
df2:
year, range
2017, [2017-09-01, 2018-08-31 23:59:59]
2018, [2018-09-01, 2019-08-31 23:59:59]
(the bracketed dates are Pandas date ranges)
我需要确定df2中的哪个“年份”,df1中的visitdate属于其中,并将其写入df1的dataferame。我尝试了多种方法来执行此操作,但是,这需要15分钟到一个小时才能运行(可悲的是,我可以从数据库中更快地执行此操作,但这不是可接受的解决方案)。我尝试通过以下几种方式来完成混合结果:
-ATTEMPT 1:使用枚举列表而不是使用第二个数据帧的FOR循环
for index, stuff in enumerate(timeframes_yrs_dtf):
if index<len(timeframes_yrs_dtf)-1:
endpos = index+1
tstart = pd.to_datetime(timeframes_yrs_dtf[index], format='%m/%Y')
tend = pd.to_datetime(timeframes_yrs_dtf[endpos], format='%m/%Y')
elif index>=len(timeframes_yrs_dtf)-1:
tstart = pd.to_datetime(timeframes_yrs_dtf[index], format='%m/%Y')
if tstart<=mytestpoint<tend:
#check to see if visitdate falls within startdate and enddate and write it to a new column in the dataframe here
-尝试2:二手ITERROWS
for index, row in df1.iterrows():
testpt = pd.to_datetime(row['visitdate'], format='%m/%Y')
for index, stuff in enumerate(timeframes_yrs_dtf):
if index<len(timeframes_yrs)-1:
endpos = index+1
tstart = pd.to_datetime(timeframes_yrs_dtf[index], format='%m/%Y')
tend = pd.to_datetime(timeframes_yrs_dtf[endpos], format='%m/%Y')
elif ix>=len(timeframes_yrs)-1:
tstart = pd.to_datetime(timeframes_yrs_dtf[index], format='%m/%Y')
if tstart<=testpt<tend:
#check to see if visitdate falls within startdate and enddate and write it to a new column in the dataframe here
-ATTEMPT 3:转换为函数,在行年的各个日期中传递
def dateadjust(toadj):
for index, row in df_yrs.iterrows():
if row['Range'].left<=toadj<=row['Range'].right:
df1.at[index,'grpyear'] = row['Year']
break
df1['grpyear'] = df1.apply(dateadjust, axis=1)
我还尝试了各种检查方法,以查看我的日期或测试点变量是否在Pandas的“范围”之内,但是,Pandas似乎不支持isin,在两次间隔之间传递的单个日期之间的比较。我还没有尝试过的事情包括针对每个范围唯一地过滤数据帧(看起来像是冗长的)以及使用带有自定义垃圾箱的CUT(因为我无法根据年份正确地对它们进行标记)。我想将其保留在Pandas库中-在Pandas中有更快/更有效的方法吗?
预期输出:
Expected Output:
df1:
id, name, visitdate, locationvisited, someotherdate, somebit, grpyear
8, 16, 2017-09-12 11:09:00.000, 18, 2018-12-28 14:27:28.503, 0, 2017
9, 18, 2019-06-11 09:56:14.663, 18, 2019-06-11 09:56:14.663, 1, 2018