使用多个布尔索引进行行操作的更有效方法

时间:2018-11-28 08:55:06

标签: python-3.x pandas

我有一个形状为df的DataFrame (7694079, 4)。列为['CompanyName', 'MetricValue', 'AsofDate', 'FiscalYear']。对于每个FiscalYear,有多个AsOfDate

示例:

         CompanyName  MetricValue   AsOfDate  FiscalYear
49  360Networks Inc.     -295.945 2001-03-31        2000
50  360Networks Inc.      101.992 2001-04-30        2000
51  360Networks Inc.      101.992 2001-05-31        2000
52  360Networks Inc.      101.992 2001-06-30        2000
53  360Networks Inc.      101.992 2001-07-31        2000
54  360Networks Inc.      101.992 2001-08-31        2000
55  360Networks Inc.      101.992 2001-09-30        2000
56  360Networks Inc.      101.992 2001-10-31        2000
57  360Networks Inc.      101.992 2001-11-30        2000
58  360Networks Inc.      101.992 2001-12-31        2000

我的目标是在df上添加一个bool列,命名为cleanse_filter,标记其中AsOfDate是每家公司FiscalYear中前6个之一的行。

此代码有效,但是每个公司和超过22k公司需要16秒钟才能运行,这将永远花费。关于如何提高效率的任何想法?

for company in df['CompanyName'].unique():
    for year in df[df['CompanyName'] == company]['FiscalYear'].unique():
        condition = (df['CompanyName'] == company)&(df['FiscalYear'] == year)    
        date_thr = pd.to_datetime(df.loc[condition]['AsOfDate']).sort_values().reset_index(drop=True)[5]
        df.loc[condition, 'cleanse_filter'] = df.loc[condition, 'AsOfDate'].apply(lambda x: True if x < date_thr else False)

1 个答案:

答案 0 :(得分:1)

我尝试重写您的解决方案:

db.items.mapReduce(
    function () {
        emit(this.name, this.price);
    },
    function (key, value) {
        Array.sum(value)
    },
    { out: "map_reduce_example" }
)