使用重复的索引值旋转pandas数据帧

时间:2015-04-28 18:01:16

标签: python pandas

我有一个数据框,每个用户都有行加入我的网站并进行购买。

+---+-----+--------------------+---------+--------+-----+
|   | uid |        msg         |  _time  | gender | age |
+---+-----+--------------------+---------+--------+-----+
| 0 |   1 | confirmed_settings | 1/29/15 | M      |  37 |
| 1 |   1 | sale               | 4/13/15 | M      |  37 |
| 2 |   3 | confirmed_settings | 4/19/15 | M      |  35 |
| 3 |   4 | confirmed_settings | 2/21/15 | M      |  21 |
| 4 |   5 | confirmed_settings | 3/28/15 | M      |  18 |
| 5 |   4 | sale               | 3/15/15 | M      |  21 |
+---+-----+--------------------+---------+--------+-----+

我想更改数据框,以便每个行对于uid都是唯一的,并且有一个名为saleconfirmed_settings的列,其中包含操作的时间戳。请注意,并非每个用户都有sale,但每个用户都有confirmed_settings。如下所示:

+---+-----+--------------------+---------+---------+--------+-----+
|   | uid | confirmed_settings |  sale   |  _time  | gender | age |
+---+-----+--------------------+---------+---------+--------+-----+
| 0 |   1 | 1/29/15            | 4/13/15 | 1/29/15 | M      |  37 |
| 1 |   3 | 4/19/15            | null    | 4/19/15 | M      |  35 |
| 2 |   4 | 2/21/15            | 3/15/15 | 2/21/15 | M      |  21 |
| 3 |   5 | 3/28/15            | null    | 3/28/15 | M      |  18 |
+---+-----+--------------------+---------+---------+--------+-----+

要做到这一点,我正在努力:

df1 = df.pivot(index='uid', columns='msg', values='_time').reset_index()
df1 = df1.merge(df[['uid', 'gender', 'age']].drop_duplicates(), on='uid')

但是我收到了这个错误:ValueError: Index contains duplicate entries, cannot reshape

如何使用重复的索引值来旋转df来转换我的数据帧?

编辑: df1 = df.pivot_table(index='uid', columns='msg', values='_time').reset_index()

给出了这个错误DataError: No numeric types to aggregate,但我甚至不确定这是正确的路径。

3 个答案:

答案 0 :(得分:2)

x是您输入的数据框:

    uid               msg   _time   gender  age
0   1   confirmed_settings  1/29/15 M       37
1   1   sale                4/13/15 M       37
2   3   confirmed_settings  4/19/15 M       35
3   4   confirmed_settings  2/21/15 M       21
4   5   confirmed_settings  3/28/15 M       18
5   4   sale                3/15/15 M       21

y = x.pivot(index='uid', columns='msg', values='_time')
x.join(y).drop('msg', axis=1)

给你:

    uid _time   gender  age     confirmed_settings  sale
0   1   1/29/15     M   37                    NaN   NaN
1   1   4/13/15     M   37                1/29/15   4/13/15
2   3   4/19/15     M   35                    NaN   NaN
3   4   2/21/15     M   21                4/19/15   NaN
4   5   3/28/15     M   18                2/21/15   3/15/15
5   4   3/15/15     M   21                3/28/15   NaN

答案 1 :(得分:2)

我怀疑确实有重复的uid - msg条目/密钥(例如uid 2在msg下有2个confirmed_settings条目),您在评论中提到了这些条目对于fixxxer的回答。如果有,则无法使用pivot,因为您无法告诉它如何处理聚合期间遇到的不同值(计数?max?mean?sum?)。请注意,索引错误是生成的透视表df1的索引上的错误,而不是原始数据框df

pivot_table允许您使用aggfunc参数执行此操作。这样的事情怎么样?

df1 = df.pivot_table(index = 'uid', columns = 'msg', values = '_time', aggfunc = len)

这将帮助您确定哪些用户消息记录具有重复条目(超过1的任何内容),并且在清除之后,您可以使用pivot上的df成功转动{{1 }}

答案 2 :(得分:1)

您可以使用groupby按公共因子进行汇总,花费最长时间来获取最近的日期,然后将信息取消堆叠以便并排查看确认设置和销售:

df.groupby(['uid', 'msg', 'gender', 'age']).time.max().unstack('msg')

msg            confirmed_settings     sale
uid gender age                            
1   M      37             1/29/15  4/13/15
3   M      35             4/19/15      NaN
4   M      21             2/21/15  3/15/15
5   M      18             3/28/15      NaN