groupby pandas之后过滤行

时间:2017-01-24 06:27:52

标签: python pandas

我在熊猫里有一张桌子:

import pandas as pd

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})

LeafID  count       pidx     pidy   score
0   1       10           10        20     10
1   1       20           10        20     10
2   2       30          300       400     10
3   1       40           10        20     22
4   3       80           30        15     22
5   3       10           40        20      3
6   1       20           20        12      4
7   6       50           10        43      5
8   3       30           20        54      9
9   5       10           45       112      0
10  1       70           20        23      1

我想要groupby,然后过滤pidx出现次数大于2的行。

即,过滤pidx为10和20的行。

我尝试使用df.groupby('pidx').count(),但它并没有帮助我。对于那些行,我必须做0.4 *计数+ 0.6 *得分。

所需的输出是:

LeafID    count       pidx     pidy    final_score
   1       10           10        20
   1       20           10        20
   1       40           10        20
   6       50           10        43
   1       20           20        12
   3       30           20        54
   1       70           20        23

4 个答案:

答案 0 :(得分:6)

这是在执行groupby之后直接应用过滤器。在您提供的数据中,pidx的值仅为20,因此会被过滤掉。

df.groupby('pidx').filter(lambda x: len(x) > 2)

   LeafID  count  pidx  pidy
0       1     10    10    20
1       1     20    10    20
3       1     40    10    20
7       6     50    10    43

答案 1 :(得分:3)

您可以将value_countsboolean indexingisin

一起使用
Formula parse error.

<强>计时

=SUM(FILTER(B1:B4,A1:A4='Lorem'))

对于df = pd.DataFrame({ 'LeafID':[1,1,2,1,3,3,1,6,3,5,1], 'pidx':[10,10,300,10,30,40,20,10,30,45,20], 'pidy':[20,20,400,20,15,20,12,43,54,112,23], 'count':[10,20,30,40,80,10,20,50,30,10,70], 'score':[10,10,10,22,22,3,4,5,9,0,1] }) print (df) LeafID count pidx pidy score 0 1 10 10 20 10 1 1 20 10 20 10 2 2 30 300 400 10 3 1 40 10 20 22 4 3 80 30 15 22 5 3 10 40 20 3 6 1 20 20 12 4 7 6 50 10 43 5 8 3 30 30 54 9 9 5 10 45 112 0 10 1 70 20 23 1 s = df.pidx.value_counts() idx = s[s>2].index print (df[df.pidx.isin(idx)]) LeafID count pidx pidy score 0 1 10 10 20 10 1 1 20 10 20 10 3 1 40 10 20 22 7 6 50 10 43 5 ,您可以使用:

np.random.seed(123)
N = 1000000


L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
                   'pidx': np.random.randint(10000, size=N),
                   'pidy': np.random.choice(L2, N),
                   'count':np.random.randint(1000, size=N)})
print (df)


print (df.groupby('pidx').filter(lambda x: len(x) > 120))

def jez(df):
    s = df.pidx.value_counts()
    return df[df.pidx.isin(s[s>120].index)]

print (jez(df))

In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop, best of 3: 1.17 s per loop

In [56]: %timeit (jez(df))
10 loops, best of 3: 141 ms per loop

In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops, best of 3: 102 ms per loop

In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop, best of 3: 685 ms per loop

In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops, best of 3: 104 ms per loop

答案 2 :(得分:1)

$start_lastplay = mktime(0, 0, 0, 1, 20, 2017); $end_lastplay = mktime(23, 59, 59, 1, 20, 2017); $lastplay = "SELECT * FROM users WHERE user_lastplayed BETWEEN :start_date AND :end_date"; $lastplayquery = $db->prepare( $lastplay ); $lastplayquery->bindParam(':start_date', $start_lastplay, PDO::PARAM_STR); $lastplayquery->bindParam(':end_date', $end_lastplay, PDO::PARAM_STR); $lastplayquery->execute(); $lastplay = $lastplayquery->rowCount();

pandas

答案 3 :(得分:0)

首先,您的输出显示您不想进行分组。阅读groupby的内容。你需要的是:

df2 = df[df['pidx']<=20]
df2.sort_index(by = 'pidx')

这将为您提供准确的结果。 阅读pandas索引和功能。实际上去阅读关于熊猫的整个介绍。这不会花太多时间。

使用索引编写行操作也很简单:

df2['final_score']= 0.4*df2['count'] + 0.6*df2['score']