
时间:2014-04-30 14:21:37

标签: python pandas



Table A: Columns = [Date, Reading]
Table B: Columns = [StartDate, EndDate, Amount]


Select Date, Reading, Amount From
    A Join B on B.StartDate <= A.Date and B.EndDate > B.Date


def find(d):
    r = B[(B['StartDate'] <= d['Date']) & (B['EndDate'] > d['Date'])]['Amount']        
    if r.count()>0:
        return r.index[0]
    return 0

A['Amount'] = A.apply(find, axis=1)



1 个答案:

答案 0 :(得分:0)

A中有40K行且B中有100行的虚拟数据集中,以下 方法比使用提供的函数.apply快约1000倍。


>>> B
   StartDate    EndDate  Amount
0 2000-01-01 2000-01-16      10
1 2000-02-01 2000-02-16      20
2 2000-03-01 2000-03-16      30
3 2000-04-01 2000-04-16      40
4 2000-05-01 2000-05-16      50
         ...        ...     ...

[100 rows x 3 columns]

B可以转换为具有索引的TimeSeries,如下所示 包含由B的范围和相应的'金额'定义的每一天 值。

def make_series(start, end, amount):
    idx = pd.date_range(start, end, freq='D', closed='left')
    return pd.Series([amount] * len(idx), index=idx)

def make_series2(s):
    idx = pd.date_range(s['StartDate'], s['EndDate'], freq='D', closed='left')
    return pd.Series([s['Amount']] * len(idx), index=idx)

# for non-overlapping ranges
>>> B2 = pd.concat([make_series(s, e, a) for _, s, e, a in B.itertuples()])

# for overlapping ranges
>>> B2 = B.apply(make_series2, axis=1).bfill().T[0]

>>> timeit B2 = pd.concat([make_series(s, e, a) for _, s, e, a in B.itertuples()])
100 loops, best of 3: 15.9 ms per loop
>>> timeit B2 = B.apply(make_series2, axis=1).bfill().T[0]
10 loops, best of 3: 54.9 ms per loop

>>> B2
2000-01-01    10
2000-01-02    10
2008-04-14    1000
2008-04-15    1000
Length: 1500


>>> A['Amount'] = B2[A.Date].fillna(0).values

>>> timeit A['Amount'] = B2[A.Date].fillna(0).values
1000 loops, best of 3: 1.91 ms per loop


>>> timeit A.apply(find, axis=1)
1 loops, best of 3: 26.3 s per loop