将pandas数据帧转换为系列

时间:2016-01-14 12:57:37

标签: pandas dataframe series

我有以下数据框。

   order_id   Clusters
0    519     Cluster 5
1    520     Cluster 1
2    521     Cluster 1
3    523     Cluster 5
4    524     Cluster 1
5    525     Cluster 4
6    526     Cluster 4
7    527     Cluster 1
8    528     Cluster 2
9    529     Cluster 5
10   530     Cluster 6
11   531     Cluster 3
12   532     Cluster 1
13   533     Cluster 4
14   534     Cluster 5
15   535     Cluster 5

我希望从以上数据框中删除以下系列。

Cluster 1   [520 ,521, 524, 527, 532]
Cluster 2   [528]
Cluster 3   [531]
Cluster 4   [525,526,533]
Cluster 5   [519,523,529,534,535]
Cluster 6   [530]

这是我在python中的方法。

clusters_order_id = []

df_clusters = df.groupby('Clusters')

for i in df_clusters['order_id']:
   clusters_order_id.append(i)

给了我

clusters_order_id
Out[196]: 
0    (Cluster 1, [520, 521, 524, 527, 532])
1                        (Cluster 2, [528])
2                        (Cluster 3, [531])
3              (Cluster 4, [525, 526, 533])
4    (Cluster 5, [519, 523, 529, 534, 535])
5                        (Cluster 6, [530])

但是我没有得到如何分成上面的一系列形式。因此Cluster 1,Cluster 2将成为我的索引,相应的order id将成为一个数组。请帮忙。

2 个答案:

答案 0 :(得分:2)

pivot_table的另一个解决方案:

In [473]: df.pivot_table(index='Clusters', aggfunc=pd.Series.tolist)
Out[473]:
                            order_id
Clusters
Cluster 1  [520, 521, 524, 527, 532]
Cluster 2                      [528]
Cluster 3                      [531]
Cluster 4            [525, 526, 533]
Cluster 5  [519, 523, 529, 534, 535]
Cluster 6                      [530]

答案 1 :(得分:1)

您可以使用groupbytolist

print df.groupby('Clusters')['order_id'].apply(lambda x: x.tolist())

Clusters
Cluster 1    [520, 521, 524, 527, 532]
Cluster 2                        [528]
Cluster 3                        [531]
Cluster 4              [525, 526, 533]
Cluster 5    [519, 523, 529, 534, 535]
Cluster 6                        [530]
Name: order_id, dtype: object

定时:

In [153]: %timeit df.groupby('Clusters')['order_id'].apply(lambda x: x.tolist())
1000 loops, best of 3: 751 µs per loop

In [154]: %timeit df.pivot_table(index='Clusters', aggfunc=pd.Series.tolist)
100 loops, best of 3: 3.55 ms per loop