在熊猫中出现错误的单词

时间:2017-05-18 12:56:43

标签: pandas

我有一些数据。你可以在下面看到他。

     user_id  item_id cate_id action_type  action_time
0   11482147   492681    1_11        view   1487174400
1   12070750   457406    1_14   deep_view   1487174400
2   12431632   527476     1_1        view   1487174400
3   13397746   531771     1_6   deep_view   1487174400
4   13794253   510089    1_27   deep_view   1487174400
5   14378544   535335     1_6   deep_view   1487174400
6    1705634   535202    1_10        view   1487174400
7    6943823   478183     1_3   deep_view   1487174400
8    5902475   524378     1_6        view   1487174401

然后我写这段代码:print(w.groupby(' user_id')。size())

但结果并非我想要的。你可以在下面看到他。

077F63F3-3DF4-4041-B3C9-7BAB2BDCA795     67
08f6ea6d2181b902d8cbeccdccf61efc         34
095A18FB-2C8E-4C00-8F2D-B481CB674ECE      4
096F9140-F748-4DE3-A4C3-EBAAA277144D     64
0B9DDF98-12A0-45DF-9CF7-F4194BF23282     64
0F3D4D6F-A906-4396-BA3B-1E69B0F6867C      8
10000484                                 88
10000886                                105
10000953                                 51
10000956                                 41
10001967                                165

为什么会出现这种情况?

1 个答案:

答案 0 :(得分:0)

如果需要将索引转换为列调用reset_index

df = w.groupby('user_id').size().reset_index(name='count')
print (df)
    user_id  count
0   1705634      1
1   5902475      1
2   6943823      1
3  11482147      1
4  12070750      1
5  12431632      1
6  13397746      1
7  13794253      1
8  14378544      1

如果致电groupby + size输出为Series,则第一列为index

s = w.groupby('user_id').size()
print (s)
user_id
1705634     1
5902475     1
6943823     1
11482147    1
12070750    1
12431632    1
13397746    1
13794253    1
14378544    1
dtype: int64

print (type(s))
<class 'pandas.core.series.Series'>

print (s.index)
Int64Index([ 1705634,  5902475,  6943823, 11482147, 12070750, 12431632,
            13397746, 13794253, 14378544],
           dtype='int64', name='user_id')

但是,如果您认为错误的值,您还可以使用sort_valuesDataFrame列对user_id进行排序并进行检查,因为groupby默认对组键进行排序:< / p>

print (w.sort_values(['user_id']))
    user_id  item_id cate_id action_type  action_time
6   1705634   535202    1_10        view   1487174400
8   5902475   524378     1_6        view   1487174401
7   6943823   478183     1_3   deep_view   1487174400
0  11482147   492681    1_11        view   1487174400
1  12070750   457406    1_14   deep_view   1487174400
2  12431632   527476     1_1        view   1487174400
3  13397746   531771     1_6   deep_view   1487174400
4  13794253   510089    1_27   deep_view   1487174400
5  14378544   535335     1_6   deep_view   1487174400

编辑:

要删除带行的非数字值,请使用:

#change first and last value to non numeric
print (w['user_id'])
0    0F3D4D6F-A906-4396-BA3B-1E69B0F6867C 
1                                 12070750
2                                 12431632
3                                 13397746
4                                 13794253
5                                 14378544
6                                  1705634
7                                  6943823
8     077F63F3-3DF4-4041-B3C9-7BAB2BDCA795
Name: user_id, dtype: object

w = w[pd.to_numeric(w['user_id'], errors='coerce').notnull()]
print (w)
    user_id  item_id cate_id action_type  action_time
1  12070750   457406    1_14   deep_view   1487174400
2  12431632   527476     1_1        view   1487174400
3  13397746   531771     1_6   deep_view   1487174400
4  13794253   510089    1_27   deep_view   1487174400
5  14378544   535335     1_6   deep_view   1487174400
6   1705634   535202    1_10        view   1487174400
7   6943823   478183     1_3   deep_view   1487174400

说明:

使用boolean indexingto_numeric创建的掩码进行过滤(参数errors='coerce'将非数字替换为NaNnotnull

print (pd.to_numeric(w['user_id'], errors='coerce'))
0           NaN
1    12070750.0
2    12431632.0
3    13397746.0
4    13794253.0
5    14378544.0
6     1705634.0
7     6943823.0
8           NaN
Name: user_id, dtype: float64
print (pd.to_numeric(w['user_id'], errors='coerce').notnull())
0    False
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8    False
Name: user_id, dtype: boo