Question

我有一个大pandas.DataFrame看起来像这样：

test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)

id  score       name
0   -0.652909   A
1   0.100885    A
2   0.410907    A
0   0.304012    B
1   -0.198157   B
2   -0.054764   B
0   0.358484    C
1   0.616415    C
2   0.389018    C
3   1.164172    C

因此索引是非唯一的，但如果我按列name分组，则该索引是唯一的。我想按名称将数据框拆分为子部分，然后将得分列（通过外部连接）组合成一个大的新数据框，并将得分的列名更改为相应的组密钥。我现在所拥有的是：

df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
    df = df.join(sub["score"], how="outer")
    df.columns.values[-1] = key

这产生了预期的结果：

id  A           B           C
0   -0.652909   0.304012    0.358484
1   0.100885    -0.198157   0.616415
2   0.410907    -0.054764   0.389018
3   NaN         NaN         1.164172

但似乎不是pandas - ic。还有更好的方法吗？

编辑：根据答案我运行了一些简单的时间。

%%timeit
df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
    df = df.join(sub["score"], how="outer")
    df.columns.values[-1] = key

100 loops, best of 3: 2.46 ms per loop

%%timeit
test.set_index([test.index, "name"]).unstack()

1000 loops, best of 3: 1.04 ms per loop

%%timeit
test.pivot_table("score", test.index, "name")

100 loops, best of 3: 2.54 ms per loop

所以unstack似乎是首选方法。

Answer 1

您要查找的功能是unstack。为了让pandas知道要删除的内容，我们将首先创建一个MultiIndex，我们将该列添加为 last 索引。然后unstack()将根据最后一个索引层取消堆栈（默认情况下），因此我们可以得到您想要的内容：

In[152]: test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)
In[153]: test
Out[153]: 
      score name
0 -0.208392    A
1 -0.103659    A
2  1.645287    A
0  0.119709    B
1 -0.047639    B
2 -0.479155    B
0 -0.415372    C
1 -1.390416    C
2 -0.384158    C
3 -1.328278    C
In[154]: test.set_index([index, 'name'], inplace=True)
test.unstack()
Out[154]: 
         score                    
name         A         B         C
0    -0.208392  0.119709 -0.415372
1    -0.103659 -0.047639 -1.390416
2     1.645287 -0.479155 -0.384158
3          NaN       NaN -1.328278

Answer 2

我最近遇到了类似的问题，通过使用pivot_table

解决了这个问题

    a = """id  score       name
0   -0.652909   A
1   0.100885    A
2   0.410907    A
0   0.304012    B
1   -0.198157   B
2   -0.054764   B
0   0.358484    C
1   0.616415    C
2   0.389018    C
3   1.164172    C"""

df = pd.read_csv(StringIO.StringIO(a),sep="\s*")
df = df.pivot_table('score','id','name')


print df

输出：

name         A         B         C
id                                
0    -0.652909  0.304012  0.358484
1     0.100885 -0.198157  0.616415
2     0.410907 -0.054764  0.389018
3          NaN       NaN  1.164172

使用groupby拆分数据框并将子集合并为列

2 个答案: