如何从具有包含列表的列的数据框中创建数据透视表?

时间:2019-12-29 17:54:23

标签: python pandas

我有一个数据框,看起来像

import pandas as pd

data = [
  {
    "userId": 1,
    "binary_vote": 0,
    "genres": [
      "Adventure",
      "Comedy"
    ]
  },
  {
    "userId": 1,
    "binary_vote": 1,
    "genres": [
      "Adventure",
      "Drama"
    ]
  },
  {
    "userId": 2,
    "binary_vote": 0,
    "genres": [
      "Comedy",
      "Drama"
    ]
  },
  {
    "userId": 2,
    "binary_vote": 1,
    "genres": [
      "Adventure",
      "Drama"
    ]
  },
]

df = pd.DataFrame(data)
print(df)

   userId  binary_vote               genres
0  1       0            [Adventure, Comedy]
1  1       1            [Adventure, Drama]
2  2       0            [Comedy, Drama]
3  2       1            [Adventure, Drama]

我想从binary_vote创建一列。这是预期的输出,

   userId        binary_vote_0       binary_vote_1
0  1       [Adventure, Comedy]  [Adventure, Drama]
1  2       [Comedy, Drama]      [Adventure, Drama]

我尝试过类似的操作,但出现错误

pd.pivot_table(df, columns=['binary_vote'], values='genres')

这是错误,

  

DataError:没有要聚合的数字类型

有什么主意吗?预先感谢。

2 个答案:

答案 0 :(得分:3)

我们必须创建自己的aggfunc,在这种情况下,这很简单。

失败的原因是因为它试图采用mean,因为它是默认的聚合函数。显然,这将在您的列表上失败。

piv = (
    df.pivot_table(index='userId', columns='binary_vote', values='genres', aggfunc=lambda x: x)
      .add_prefix('binary_vote_')
      .reset_index()
      .rename_axis(None, axis=1)
)
print(piv)
   userId        binary_vote_0       binary_vote_1
0       1  [Adventure, Comedy]  [Adventure, Drama]
1       2      [Comedy, Drama]  [Adventure, Drama]

答案 1 :(得分:1)

使用set_index()unstack()的另一种方法:

m=(df.set_index(['userId','binary_vote']).unstack()
     .add_prefix('binary_vote_').droplevel(level=0,axis=1))
m.reset_index().rename_axis(None,axis=1)

   userId        binary_vote_0       binary_vote_1
0       1  [Adventure, Comedy]  [Adventure, Drama]
1       2      [Comedy, Drama]  [Adventure, Drama]