我有一个形状为df
的DataFrame (7694079, 4)
。列为['CompanyName', 'MetricValue', 'AsofDate', 'FiscalYear']
。对于每个FiscalYear
,有多个AsOfDate
。
示例:
CompanyName MetricValue AsOfDate FiscalYear 49 360Networks Inc. -295.945 2001-03-31 2000 50 360Networks Inc. 101.992 2001-04-30 2000 51 360Networks Inc. 101.992 2001-05-31 2000 52 360Networks Inc. 101.992 2001-06-30 2000 53 360Networks Inc. 101.992 2001-07-31 2000 54 360Networks Inc. 101.992 2001-08-31 2000 55 360Networks Inc. 101.992 2001-09-30 2000 56 360Networks Inc. 101.992 2001-10-31 2000 57 360Networks Inc. 101.992 2001-11-30 2000 58 360Networks Inc. 101.992 2001-12-31 2000
我的目标是在df
上添加一个bool列,命名为cleanse_filter
,标记其中AsOfDate
是每家公司FiscalYear
中前6个之一的行。
此代码有效,但是每个公司和超过22k公司需要16秒钟才能运行,这将永远花费。关于如何提高效率的任何想法?
for company in df['CompanyName'].unique():
for year in df[df['CompanyName'] == company]['FiscalYear'].unique():
condition = (df['CompanyName'] == company)&(df['FiscalYear'] == year)
date_thr = pd.to_datetime(df.loc[condition]['AsOfDate']).sort_values().reset_index(drop=True)[5]
df.loc[condition, 'cleanse_filter'] = df.loc[condition, 'AsOfDate'].apply(lambda x: True if x < date_thr else False)
答案 0 :(得分:1)
我尝试重写您的解决方案:
db.items.mapReduce(
function () {
emit(this.name, this.price);
},
function (key, value) {
Array.sum(value)
},
{ out: "map_reduce_example" }
)