我有一个包含数千行和20列的DataFrame。日期是索引,并且有许多相同的日期。示例df:
Stock Sales Data 1 Data 2
1/1/2012 Apple 120 0.996691907 0.376607328
1/1/2012 Apple 230 0.084699221 0.56433743
1/1/2012 Apple 340 0.141253424 0.319522467
1/1/2012 Berry 230 0.506264018 0.123657902
1/1/2012 Berry 340 0.646633737 0.635841995
1/1/2012 Cat 1250 0.204030887 0.928827628
1/1/2012 Cat 850 0.556935133 0.81033956
1/1/2012 Cat 650 0.771751177 0.988848472
1/1/2012 Cat 650 0.615222763 0.468555772
1/2/2012 Apple 1065 0.504410742 0.402553442
1/2/2012 Apple 200 0.752335341 0.487556857
1/2/2012 BlackBerry 1465 0.693017964 0.925737402
1/2/2012 BlackBerry 2000 0.262392424 0.076542936
1/2/2012 BlackBerry 1465 0.851841806 0.345077839
1/2/2012 BlackBerry 1465 0.70635569 0.718340524
1/2/2012 Tomato 700 0.911297224 0.155699549
1/2/2012 Tomato 235 0.118843588 0.662083069
1/2/2012 Carrot 500 0.07255267 0.585773563
我想过滤数据,以便每个日期和每个股票我最多只显示3行,并根据具有最大销售额的行选择这3个。
如果存在每个日期和库存中只有1或2的情况,那么它自然会保留所有行。
如果日期和库存组有3行或更多行,那么我只需要3行作为3个最大销售额。如果有一个联合的第三个位置(具有相同的销售数字),我仍然只想要那个日期和库存的最大3行,所以无论是通过随机选择还是任何其他合适的方法,我仍然会为该股票吐出3行特别的日期。
示例输出可能是这样的:
Stock Sales Data 1 Data 2
1/1/2012 Apple 120 0.996691907 0.376607328
1/1/2012 Apple 230 0.084699221 0.56433743
1/1/2012 Apple 340 0.141253424 0.319522467
1/1/2012 Berry 230 0.506264018 0.123657902
1/1/2012 Berry 340 0.646633737 0.635841995
1/1/2012 Cat 1250 0.204030887 0.928827628
1/1/2012 Cat 850 0.556935133 0.81033956
1/1/2012 Cat 650 0.771751177 0.988848472
1/2/2012 Apple 1065 0.504410742 0.402553442
1/2/2012 Apple 200 0.752335341 0.487556857
1/2/2012 BlackBerry 2000 0.262392424 0.076542936
1/2/2012 BlackBerry 1465 0.851841806 0.345077839
1/2/2012 BlackBerry 1465 0.70635569 0.718340524
1/2/2012 Tomato 700 0.911297224 0.155699549
1/2/2012 Tomato 235 0.118843588 0.662083069
1/2/2012 Carrot 500 0.07255267 0.585773563
答案 0 :(得分:1)
>>> data.groupby([data.index, data.Stock]).Sales.nlargest(3)
Stock
1/1/2012 Apple 1/1/2012 340
1/1/2012 230
1/1/2012 120
Berry 1/1/2012 340
1/1/2012 230
Cat 1/1/2012 1250
1/1/2012 850
1/1/2012 650
1/2/2012 Apple 1/2/2012 1065
1/2/2012 200
BlackBerry 1/2/2012 2000
1/2/2012 1465
1/2/2012 1465
Carrot 1/2/2012 500
Tomato 1/2/2012 700
1/2/2012 235
Name: Sales, dtype: int64
当然,如果您想输出DataFrame的完整子集而不是相关信息,我们可以使用iloc
。
>>> data.iloc[data.reset_index().groupby(['index', 'Stock'])
.Sales.nlargest(3).index.levels[2]]
Stock Sales Data1 Data2
1/1/2012 Apple 120 0.996692 0.376607
1/1/2012 Apple 230 0.084699 0.564337
1/1/2012 Apple 340 0.141253 0.319522
1/1/2012 Berry 230 0.506264 0.123658
1/1/2012 Berry 340 0.646634 0.635842
1/1/2012 Cat 1250 0.204031 0.928828
1/1/2012 Cat 850 0.556935 0.810340
1/1/2012 Cat 650 0.771751 0.988848
1/2/2012 Apple 1065 0.504411 0.402553
1/2/2012 Apple 200 0.752335 0.487557
1/2/2012 BlackBerry 1465 0.693018 0.925737
1/2/2012 BlackBerry 2000 0.262392 0.076543
1/2/2012 BlackBerry 1465 0.851842 0.345078
1/2/2012 Tomato 700 0.911297 0.155700
1/2/2012 Tomato 235 0.118844 0.662083
1/2/2012 Carrot 500 0.072553 0.585774
答案 1 :(得分:0)
使用sort_values(),groupby()和head()似乎可以产生您正在寻找的结果。
import pandas as pd
df = pd.read_table('fruit', sep='\s+')
df.Date = pd.to_datetime(df.Date)
df.sort_values(by=['Date', 'Stock', 'Sales'],
ascending=[True, True, False],
inplace=True)
# Date Stock Sales Data1 Data2
# 2 2012-01-01 Apple 340 0.141253 0.319522
# 1 2012-01-01 Apple 230 0.084699 0.564337
# 0 2012-01-01 Apple 120 0.996692 0.376607
# 4 2012-01-01 Berry 340 0.646634 0.635842
# 3 2012-01-01 Berry 230 0.506264 0.123658
# 5 2012-01-01 Cat 1250 0.204031 0.928828
# 6 2012-01-01 Cat 850 0.556935 0.810340
# 7 2012-01-01 Cat 650 0.771751 0.988848
# 8 2012-01-01 Cat 650 0.615223 0.468556
# 9 2012-01-02 Apple 1065 0.504411 0.402553
# 10 2012-01-02 Apple 200 0.752335 0.487557
# 12 2012-01-02 BlackBerry 2000 0.262392 0.076543
# 11 2012-01-02 BlackBerry 1465 0.693018 0.925737
# 13 2012-01-02 BlackBerry 1465 0.851842 0.345078
# 14 2012-01-02 BlackBerry 1465 0.706356 0.718341
# 17 2012-01-02 Carrot 500 0.072553 0.585774
# 15 2012-01-02 Tomato 700 0.911297 0.155700
# 16 2012-01-02 Tomato 235 0.118844 0.662083
df.groupby(by=['Date','Stock'], as_index=False, sort=False).head(3)
print df
# Date Stock Sales Data1 Data2
# 2 2012-01-01 Apple 340 0.141253 0.319522
# 1 2012-01-01 Apple 230 0.084699 0.564337
# 0 2012-01-01 Apple 120 0.996692 0.376607
# 4 2012-01-01 Berry 340 0.646634 0.635842
# 3 2012-01-01 Berry 230 0.506264 0.123658
# 5 2012-01-01 Cat 1250 0.204031 0.928828
# 6 2012-01-01 Cat 850 0.556935 0.810340
# 7 2012-01-01 Cat 650 0.771751 0.988848
# 9 2012-01-02 Apple 1065 0.504411 0.402553
# 10 2012-01-02 Apple 200 0.752335 0.487557
# 12 2012-01-02 BlackBerry 2000 0.262392 0.076543
# 11 2012-01-02 BlackBerry 1465 0.693018 0.925737
# 13 2012-01-02 BlackBerry 1465 0.851842 0.345078
# 17 2012-01-02 Carrot 500 0.072553 0.585774
# 15 2012-01-02 Tomato 700 0.911297 0.155700
# 16 2012-01-02 Tomato 235 0.118844 0.662083