我有一些数据。你可以在下面看到他。
user_id item_id cate_id action_type action_time
0 11482147 492681 1_11 view 1487174400
1 12070750 457406 1_14 deep_view 1487174400
2 12431632 527476 1_1 view 1487174400
3 13397746 531771 1_6 deep_view 1487174400
4 13794253 510089 1_27 deep_view 1487174400
5 14378544 535335 1_6 deep_view 1487174400
6 1705634 535202 1_10 view 1487174400
7 6943823 478183 1_3 deep_view 1487174400
8 5902475 524378 1_6 view 1487174401
然后我写这段代码:print(w.groupby(' user_id')。size())
但结果并非我想要的。你可以在下面看到他。
077F63F3-3DF4-4041-B3C9-7BAB2BDCA795 67
08f6ea6d2181b902d8cbeccdccf61efc 34
095A18FB-2C8E-4C00-8F2D-B481CB674ECE 4
096F9140-F748-4DE3-A4C3-EBAAA277144D 64
0B9DDF98-12A0-45DF-9CF7-F4194BF23282 64
0F3D4D6F-A906-4396-BA3B-1E69B0F6867C 8
10000484 88
10000886 105
10000953 51
10000956 41
10001967 165
为什么会出现这种情况?
答案 0 :(得分:0)
如果需要将索引转换为列调用reset_index
:
df = w.groupby('user_id').size().reset_index(name='count')
print (df)
user_id count
0 1705634 1
1 5902475 1
2 6943823 1
3 11482147 1
4 12070750 1
5 12431632 1
6 13397746 1
7 13794253 1
8 14378544 1
如果致电groupby
+ size
输出为Series
,则第一列为index
:
s = w.groupby('user_id').size()
print (s)
user_id
1705634 1
5902475 1
6943823 1
11482147 1
12070750 1
12431632 1
13397746 1
13794253 1
14378544 1
dtype: int64
print (type(s))
<class 'pandas.core.series.Series'>
print (s.index)
Int64Index([ 1705634, 5902475, 6943823, 11482147, 12070750, 12431632,
13397746, 13794253, 14378544],
dtype='int64', name='user_id')
但是,如果您认为错误的值,您还可以使用sort_values
按DataFrame
列对user_id
进行排序并进行检查,因为groupby
默认对组键进行排序:< / p>
print (w.sort_values(['user_id']))
user_id item_id cate_id action_type action_time
6 1705634 535202 1_10 view 1487174400
8 5902475 524378 1_6 view 1487174401
7 6943823 478183 1_3 deep_view 1487174400
0 11482147 492681 1_11 view 1487174400
1 12070750 457406 1_14 deep_view 1487174400
2 12431632 527476 1_1 view 1487174400
3 13397746 531771 1_6 deep_view 1487174400
4 13794253 510089 1_27 deep_view 1487174400
5 14378544 535335 1_6 deep_view 1487174400
编辑:
要删除带行的非数字值,请使用:
#change first and last value to non numeric
print (w['user_id'])
0 0F3D4D6F-A906-4396-BA3B-1E69B0F6867C
1 12070750
2 12431632
3 13397746
4 13794253
5 14378544
6 1705634
7 6943823
8 077F63F3-3DF4-4041-B3C9-7BAB2BDCA795
Name: user_id, dtype: object
w = w[pd.to_numeric(w['user_id'], errors='coerce').notnull()]
print (w)
user_id item_id cate_id action_type action_time
1 12070750 457406 1_14 deep_view 1487174400
2 12431632 527476 1_1 view 1487174400
3 13397746 531771 1_6 deep_view 1487174400
4 13794253 510089 1_27 deep_view 1487174400
5 14378544 535335 1_6 deep_view 1487174400
6 1705634 535202 1_10 view 1487174400
7 6943823 478183 1_3 deep_view 1487174400
说明:
使用boolean indexing
按to_numeric
创建的掩码进行过滤(参数errors='coerce'
将非数字替换为NaN
)notnull
:
print (pd.to_numeric(w['user_id'], errors='coerce'))
0 NaN
1 12070750.0
2 12431632.0
3 13397746.0
4 13794253.0
5 14378544.0
6 1705634.0
7 6943823.0
8 NaN
Name: user_id, dtype: float64
print (pd.to_numeric(w['user_id'], errors='coerce').notnull())
0 False
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 False
Name: user_id, dtype: boo