我在熊猫里有一张桌子:
import pandas as pd
df = pd.DataFrame({
'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
'pidx':[10,10,300,10,30,40,20,10,30,45,20],
'pidy':[20,20,400,20,15,20,12,43,54,112,23],
'count':[10,20,30,40,80,10,20,50,30,10,70],
'score':[10,10,10,22,22,3,4,5,9,0,1]
})
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
2 2 30 300 400 10
3 1 40 10 20 22
4 3 80 30 15 22
5 3 10 40 20 3
6 1 20 20 12 4
7 6 50 10 43 5
8 3 30 20 54 9
9 5 10 45 112 0
10 1 70 20 23 1
我想要groupby
,然后过滤pidx
出现次数大于2的行。
即,过滤pidx
为10和20的行。
我尝试使用df.groupby('pidx').count()
,但它并没有帮助我。对于那些行,我必须做0.4 *计数+ 0.6 *得分。
所需的输出是:
LeafID count pidx pidy final_score
1 10 10 20
1 20 10 20
1 40 10 20
6 50 10 43
1 20 20 12
3 30 20 54
1 70 20 23
答案 0 :(得分:6)
这是在执行groupby之后直接应用过滤器。在您提供的数据中,pidx的值仅为20,因此会被过滤掉。
df.groupby('pidx').filter(lambda x: len(x) > 2)
LeafID count pidx pidy
0 1 10 10 20
1 1 20 10 20
3 1 40 10 20
7 6 50 10 43
答案 1 :(得分:3)
您可以将value_counts
与boolean indexing
和isin
:
Formula parse error.
<强>计时强>:
=SUM(FILTER(B1:B4,A1:A4='Lorem'))
对于df = pd.DataFrame({
'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
'pidx':[10,10,300,10,30,40,20,10,30,45,20],
'pidy':[20,20,400,20,15,20,12,43,54,112,23],
'count':[10,20,30,40,80,10,20,50,30,10,70],
'score':[10,10,10,22,22,3,4,5,9,0,1]
})
print (df)
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
2 2 30 300 400 10
3 1 40 10 20 22
4 3 80 30 15 22
5 3 10 40 20 3
6 1 20 20 12 4
7 6 50 10 43 5
8 3 30 30 54 9
9 5 10 45 112 0
10 1 70 20 23 1
s = df.pidx.value_counts()
idx = s[s>2].index
print (df[df.pidx.isin(idx)])
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
3 1 40 10 20 22
7 6 50 10 43 5
,您可以使用:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'count':np.random.randint(1000, size=N)})
print (df)
print (df.groupby('pidx').filter(lambda x: len(x) > 120))
def jez(df):
s = df.pidx.value_counts()
return df[df.pidx.isin(s[s>120].index)]
print (jez(df))
In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop, best of 3: 1.17 s per loop
In [56]: %timeit (jez(df))
10 loops, best of 3: 141 ms per loop
In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops, best of 3: 102 ms per loop
In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop, best of 3: 685 ms per loop
In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops, best of 3: 104 ms per loop
答案 2 :(得分:1)
$start_lastplay = mktime(0, 0, 0, 1, 20, 2017);
$end_lastplay = mktime(23, 59, 59, 1, 20, 2017);
$lastplay = "SELECT * FROM users WHERE user_lastplayed BETWEEN :start_date AND :end_date";
$lastplayquery = $db->prepare( $lastplay );
$lastplayquery->bindParam(':start_date', $start_lastplay, PDO::PARAM_STR);
$lastplayquery->bindParam(':end_date', $end_lastplay, PDO::PARAM_STR);
$lastplayquery->execute();
$lastplay = $lastplayquery->rowCount();
pandas
答案 3 :(得分:0)
首先,您的输出显示您不想进行分组。阅读groupby的内容。你需要的是:
df2 = df[df['pidx']<=20]
df2.sort_index(by = 'pidx')
这将为您提供准确的结果。 阅读pandas索引和功能。实际上去阅读关于熊猫的整个介绍。这不会花太多时间。
使用索引编写行操作也很简单:
df2['final_score']= 0.4*df2['count'] + 0.6*df2['score']