我有一个数据框,我想将重影行(现有行的副本)附加到该数据框。
id month as_of_date1 turn age
119 5712 201401 2014-01-01 9 0
120 5712 201402 2014-02-01 9 1
121 5712 201403 2014-03-01 9 2
122 5712 201404 2014-04-01 9 3
123 5712 201405 2014-05-01 9 4
124 5712 201406 2014-06-01 9 5
125 9130 201401 2014-01-01 9 0
126 9130 201402 2014-02-01 9 1
127 9130 201403 2014-03-01 9 2
128 9130 201404 2014-04-01 9 3
129 9130 201405 2014-05-01 9 4
通过条件选择幻像行:
如果年龄小于转弯年龄,则需要在age== turn of
或as_of_date1 == now()
现在我正在使用以下代码,但是由于数据量很大,大约200k行,包含100个字段,因此永远需要
tdf1=tdf.loc[(tdf['age']<tdf['turn'])]
tdf2=tdf1.drop_duplicates(subset=['id'],keep='last')
leads=tdf2.index.tolist()
for lead in leads:
ttdf=tdf.loc[[lead]]
diff1 = relativedelta.relativedelta(datetime.datetime(2018,6,1),tdf.loc[lead,'as_of_date1']).months
diff2=tdf.loc[lead,'turn']-tdf.loc[lead,'age']
diff=min(diff1,diff2)
for i in range(0,diff):
tdf = tdf.append(ttdf, ignore_index=True)
预期结果:
id month as_of_date1 turn age
119 5712 201401 2014-01-01 9 0
120 5712 201402 2014-02-01 9 1
121 5712 201403 2014-03-01 9 2
122 5712 201404 2014-04-01 9 3
123 5712 201405 2014-05-01 9 4
124 5712 201406 2014-06-01 9 5
125 9130 201401 2014-01-01 9 0
126 9130 201402 2014-02-01 9 1
127 9130 201403 2014-03-01 9 2
128 9130 201404 2014-04-01 9 3
129 9130 201405 2014-05-01 9 4
130 5712 201406 2014-06-01 9 5
131 5712 201406 2014-06-01 9 5
132 5712 201406 2014-06-01 9 5
133 5712 201406 2014-06-01 9 5
134 9130 201405 2014-05-01 9 4
135 9130 201405 2014-05-01 9 4
136 9130 201405 2014-05-01 9 4
137 9130 201405 2014-05-01 9 4
138 9130 201405 2014-05-01 9 4
如果有人知道更快的算法,我将不胜感激
答案 0 :(得分:0)
正如在注释中提到的@Parfit附加到数据帧上确实消耗内存,根本不建议在循环内执行此操作。所以我使用以下方法极大地提高了速度
a=[]
for lead in leads:
ttdf=tdf.loc[[lead]]
diff1 = relativedelta.relativedelta(datetime.datetime(2018,6,1),tdf.loc[lead,'as_of_date1']).months
diff2=tdf.loc[lead,'turn']-tdf.loc[lead,'age']
diff=min(diff1,diff2)
for i in range(0,diff):
a.append(ttdf)
tdf = tdf.append(a, ignore_index=True)