我有一个数据帧df
,如下所示:
Rank User
0 1690 samberman1212
1 1690 khogan3131
2 1690 narguero
3 1690 Awesemo
4 1690 Awesemo
5 1690 cptnspaulding
6 1690 Fluke7634
7 1690 giantsquid
8 1690 vidthekid22
9 1690 I_Slewfoot_U
10 1690 Mirage88
11 1690 Mirage88
12 1690 Mirage88
13 1690 Testosterown
14 1715 Anonymous
15 1715 Anonymous
我想使用每个用户在df
中的访问次数。
因此,我创建了一个新的数据框multiple
:
multiple=users_df.groupby("User").count()
print(multiple)
Rank
User
Anonymous 2
Awesemo 2
Fluke7634 1
I_Slewfoot_U 1
Mirage88 3
Testosterown 1
cptnspaulding 1
giantsquid 1
khogan3131 1
narguero 1
samberman1212 1
vidthekid22 1
最后,我想提取频率计数以将其分配给变量count_IP
:
for i, row in users_df.iterrows():
rank = row['Rank']
user = row['User']
count_IP= ???
因此,我希望能够在multiple
中查找当时循环中用户(user
)的计数值...我尝试:
multiple.query('User==user')['Rank']
#AND
multiple[multiple["User"]==user]["Rank"]
但是,两者都不起作用。看起来像用户列,分组依据所在的变量是不可调用的。因为当我询问列名称时:
list(multiple.columns.values)
['Rank']
我该如何解决?
更新:
让我想获得的不是频率计数:
Rank User order of appearance
0 1690 samberman1212 1
1 1690 khogan3131 1
2 1690 narguero 1
3 1690 Awesemo 1
4 1690 Awesemo 2
5 1690 cptnspaulding 1
6 1690 Fluke7634 1
7 1690 giantsquid 1
8 1690 vidthekid22 1
9 1690 I_Slewfoot_U 1
10 1690 Mirage88 1
11 1690 Mirage88 2
12 1690 Mirage88 3
13 1690 Testosterown 1
14 1715 Anonymous 1
15 1715 Anonymous 2
UPDATE#2:
我正在尝试使用匿名性更高的数据框。
ank User order of appearence
0 1 boggslite 1
1 2 dokcash 1
2 3 loumister35 1
3 4 drhass 1
4 5 onem4nwolfpack 1
5 6 felder15 1
6 7 TwoStix 1
7 8 Mwise120 1
8 9 sdchickens 1
9 10 tastefultides 1
10 11 bric75 1
11 12 ycmmat 1
12 13 tastefultides 1
13 14 mpgoldberg16 1
14 14 mpgoldberg16 2
15 16 Cicima6709 1
16 17 LSUTom123 1
17 18 bunglerprime 1
18 18 Testosterown 1
19 20 dfsteams 1
20 20 yankeesfan2 1
21 22 tfinnerty 1
22 23 bellmar21 1
23 24 Awesemo 1
24 25 shocky26 1
25 25 tastefultides 1
26 27 Thanks4DaChedda 1
27 28 isupol 1
28 28 jwestphal708 1
29 30 giantsquid 1
30 31 boggslite 1
31 32 Thanks4DaChedda 1
32 33 dre87 1
33 33 BlarneyBoys 1
34 33 bric75 1
35 36 ezellmt 1
36 36 Cicima6709 1
37 38 ivanage 1
38 38 Thanks4DaChedda 1
39 40 nevs2904 1
40 41 gridironguru999 1
41 42 Anonymous 1
42 43 Anonymous 1
43 44 Anonymous 1
44 45 Anonymous 1
45 45 Anonymous 2
46 47 Anonymous 1
47 48 Anonymous 1
48 49 Anonymous 1
49 50 Anonymous 1
答案 0 :(得分:2)
您要与transform()
df.groupby()
一起使用的IIUC
df['count_IP']=df.groupby('User').transform('count')
print(df)
Rank User count_IP
0 1690 samberman1212 1
1 1690 khogan3131 1
2 1690 narguero 1
3 1690 Awesemo 2
4 1690 Awesemo 2
5 1690 cptnspaulding 1
6 1690 Fluke7634 1
7 1690 giantsquid 1
8 1690 vidthekid22 1
9 1690 I_Slewfoot_U 1
10 1690 Mirage88 3
11 1690 Mirage88 3
12 1690 Mirage88 3
13 1690 Testosterown 1
14 1715 Anonymous 2
15 1715 Anonymous 2
如果要删除重复的值,则可以执行df=df.drop_duplicates()
。
编辑显示顺序:
df['order of appearence']=df.groupby('User')['User'].transform(lambda x : x.duplicated().cumsum().add(1))
print(df)
Rank User order of appearence
0 1690 samberman1212 1
1 1690 khogan3131 1
2 1690 narguero 1
3 1690 Awesemo 1
4 1690 Awesemo 2
5 1690 cptnspaulding 1
6 1690 Fluke7634 1
7 1690 giantsquid 1
8 1690 vidthekid22 1
9 1690 I_Slewfoot_U 1
10 1690 Mirage88 1
11 1690 Mirage88 2
12 1690 Mirage88 3
13 1690 Testosterown 1
14 1715 Anonymous 1
15 1715 Anonymous 2