我对pandas中的行选择有疑问。我们来看下面的例子:
df = pd.DataFrame({
'Branch' : 'A A A A A B'.split(),
'Buyer': 'Carl Mark Carl Joe Mark Carl'.split(),
'Quantity': [1,3,5,8,9,3],
'Date' : [
DT.datetime(2013,9,1,13,0),
DT.datetime(2013,9,1,13,5),
DT.datetime(2013,10,1,20,0),
DT.datetime(2013,10,3,10,0),
DT.datetime(2013,12,2,12,0),
DT.datetime(2013,12,2,14,0),
]})
我想有效地找到那些“卡尔”和“马克”买了东西的日子,包括相应的购买日期。例如像这样
Date_1 Buyer_1 Date Buyer
Day
2013-09-01 2013-09-01 13:00:00 Carl 2013-09-01 13:05:00 Mark
2013-12-02 2013-12-02 14:00:00 Carl 2013-12-02 12:00:00 Mark
为此,我目前正在使用以下代码:
df['Day'] = df.Date.map(lambda t: t.date())
df = df.set_index('Day')
day1 = df[df.Buyer == 'Carl'][['Date', 'Buyer']]
day2 = df[df.Buyer == 'Mark'][['Date', 'Buyer']]
test1 = day1.join(day2, lsuffix='_1')
test1 = test1.dropna()
但是,此代码无法正常执行(timeit.timeit(mytest,number = 1000))~4s
有没有人知道如何提高此计算的性能并保持可读性?
我将不胜感激。
安迪
答案 0 :(得分:1)
试试这个:
In [69]: df[df['Buyer'].isin(['Carl', 'Mark'])].set_index('Buyer', append=True)[['Date']].unstack(['Buyer'])
Out[69]:
Date
Buyer Carl Mark
Day
2013-09-01 2013-09-01 13:00:00 2013-09-01 13:05:00
2013-10-01 2013-10-01 20:00:00 NaT
2013-12-02 2013-12-02 14:00:00 2013-12-02 12:00:00
答案 1 :(得分:1)
如果你没有把索引设置为Day,那么你可以使用filter
(很快就会出现0.12):
In [11]: df
Out[11]:
Day Branch Buyer Date Quantity
0 2013-09-01 A Carl 2013-09-01 13:00:00 1
1 2013-09-01 A Mark 2013-09-01 13:05:00 3
2 2013-10-01 A Carl 2013-10-01 20:00:00 5
3 2013-10-03 A Joe 2013-10-03 10:00:00 8
4 2013-12-02 A Mark 2013-12-02 12:00:00 9
5 2013-12-02 B Carl 2013-12-02 14:00:00 3
In [12]: g = df.groupby('Day', as_index=False)
In [13]: df1 = g.filter(lambda row: set(['Carl', 'Mark']).issubset(set(row.Buyer)))
In [14]: df1
Out[14]:
Day Branch Buyer Date Quantity
0 2013-09-01 A Carl 2013-09-01 13:00:00 1
1 2013-09-01 A Mark 2013-09-01 13:05:00 3
4 2013-12-02 A Mark 2013-12-02 12:00:00 9
5 2013-12-02 B Carl 2013-12-02 14:00:00 3
然后可以使用pivot_table
:
In [15]: df1.pivot_table('Quantity', 'Day', 'Buyer')
Out[15]:
Buyer Carl Mark
Day
2013-09-01 1 3
2013-12-02 3 9
In [16]: df1.pivot_table(['Date', 'Quantity'], 'Day', 'Buyer',
aggfunc=lambda t: t.values[0])
Out[16]:
Date Quantity
Buyer Carl Mark Carl Mark
Day
2013-09-01 2013-09-01 13:00:00 2013-09-01 13:05:00 1 3
2013-12-02 2013-12-02 14:00:00 2013-12-02 12:00:00 3 9
考虑一下,或许首先做出支点更有意义:
In [21]: pv1 = df.pivot_table('Quantity', 'Day', 'Buyer')
In [22]: pv1[pd.notnull(pv1['Mark']) & pd.notnull(pv1['Carl'])]
Out[22]:
Buyer Carl Joe Mark
Day
2013-09-01 1 NaN 3
2013-12-02 3 NaN 9
In [22]: pv2 = df.pivot_table(['Date', 'Quantity'], 'Day', 'Buyer',
aggfunc=lambda t: t.values[0])
In [23]: pv2[pd.notnull(pv2[('Quantity', 'Mark')]) & pd.notnull(pv2[('Quantity', 'Carl')])]
Out[23]:
Date Quantity
Buyer Carl Joe Mark Carl Joe Mark
Day
2013-09-01 2013-09-01T14:00:00.000000000+0100 NaN 2013-09-01T14:05:00.000000000+0100 1 NaN 3
2013-12-02 2013-12-02T14:00:00.000000000+0000 NaN 2013-12-02T12:00:00.000000000+0000 3 NaN 9