Python / Pandas - ValueError:索引包含重复的条目,无法重塑

时间:2017-08-22 19:19:04

标签: python pandas

我有一个名为'bal'的数据框。它看起来像这样:

              ano   id   unit period
business_id                         
9564         2012  302  sdasd  anual
9564         2011  303  sdasd  anual
2361         2013  304  sdasd  anual
2361         2012  305  sdasd  anual
...

我正在运行以下代码:

bal=bal.merge(bal.pivot(columns='ano', values='id'),right_index=True,left_index=True)

我的意图是把它变成这样的东西:

               ano    id  unit    period  2006  2007  2008  2009  2010  \
 business_id                                                                     

 72           2013   774  sdasd   anual   NaN   NaN   NaN   NaN   NaN   

 72           2012   775  sdasd   anual   NaN   NaN   NaN   NaN   NaN   

 74           2012  1120  sdasd   anual   NaN   NaN   NaN   NaN   NaN   

 119          2013   875  sdasd   anual   NaN   NaN   NaN   NaN   NaN   

 119          2012   876  sdasd   anual   NaN   NaN   NaN   NaN   NaN   

 ...

当我编码时,我收到此错误:

ValueError: Index contains duplicate entries, cannot reshape

为避免重复,我添加了一个drop_duplicates行:

bal=bal.drop_duplicates()
bal=bal.merge(bal.pivot(columns='ano', values='id'),right_index=True,left_index=True)

当我运行代码时,我有同样的问题:

ValueError: Index contains duplicate entries, cannot reshape

我做错了什么或误解了什么?

修改

bal 是我使用以下代码从SQL创建的数据框:

bal=pd.read_sql('select * from table;',connection).set_index('business_id')[['ano','id','unit','period']]

奇怪的是,如果我限制SQL查询,它可以正常工作:

bal=pd.read_sql('select * from table limit 1000;',connection).set_index('business_id')[['ano','id','unit','period']]

我认为这个问题可能与索引有很多重复这一事实有关(正如你在上面的例子中看到的那样)。但是,如果我print(bal.head(4))在这个有限的bal中,它看起来与您在上面看到的完全相同,索引会重复。

2 个答案:

答案 0 :(得分:3)

<强> UPDATE2:

qry = "select distinct business_id,ano,id,unit,period from table where period='anual'"
bal=pd.read_sql(qry, connection, index_col=['business_id'])

假设我们得到以下DF(ano列中仍有重复值):

In [167]: bal
Out[167]:
              ano   id   unit period
business_id
9564         2012  302  sdasd  anual
9564         2012  299  sdasd  anual
9564         2011  303  sdasd  anual
2361         2013  304  sdasd  anual
2361         2012  305  sdasd  anual

我们可以这样做:

In [169]: bal.join(bal.pivot_table(index=bal.index, columns='ano',
                                   values='id', aggfunc='first'))
Out[169]:
              ano   id   unit period   2011   2012   2013
business_id
2361         2013  304  sdasd  anual    NaN  305.0  304.0
2361         2012  305  sdasd  anual    NaN  305.0  304.0
9564         2012  302  sdasd  anual  303.0  302.0    NaN
9564         2012  299  sdasd  anual  303.0  302.0    NaN
9564         2011  303  sdasd  anual  303.0  302.0    NaN

<强>更新

考虑以下样本DF:

In [161]: bal
Out[161]:
              ano   id   unit period
business_id
9564         2012  302  sdasd  anual
9564         2012  299  sdasd  anual   # i've intentionally added this row with duplicated `ano`
9564         2011  303  sdasd  anual
2361         2013  304  sdasd  anual
2361         2012  305  sdasd  anual

重现你的错误:

In [162]: bal.pivot(columns='ano', values='id')
...
skipped
...
ValueError: Index contains duplicate entries, cannot reshape

旧回答:

这就是你想要的吗?

In [144]: bal.join(bal.pivot(columns='ano', values='id'))
Out[144]:
              ano   id   unit period   2011   2012   2013
business_id
2361         2013  304  sdasd  anual    NaN  305.0  304.0
2361         2012  305  sdasd  anual    NaN  305.0  304.0
9564         2012  302  sdasd  anual  303.0  302.0    NaN
9564         2011  303  sdasd  anual  303.0  302.0    NaN

答案 1 :(得分:2)

考虑使用unstack()merge() - 这将解决重复问题。

# sample data
data = {"business_id":[9564, 9564, 2361, 2361],
        "ano":[2012, 2011, 2013, 2012],
        "id":[302,303,304,305],
        "unit":["sdasd"]*4,
        "period":["anual"]*4}
df = pd.DataFrame(data)
# include ano for MultiIndex
df.set_index(["business_id","ano"], inplace=True)

df
                   id period   unit
business_id ano                    
9564        2012  302  anual  sdasd
            2011  303  anual  sdasd
2361        2013  304  anual  sdasd
            2012  305  anual  sdasd

现在unstack(),抓取id数据和merge()。最里面的级别是未堆叠的,这就是为什么我们将ano添加到上面的索引。

df.merge(df.unstack()['id'], right_index=True, left_index=True)
                   id period   unit   2011   2012   2013
business_id ano                                         
9564        2012  302  anual  sdasd  303.0  302.0    NaN
            2011  303  anual  sdasd  303.0  302.0    NaN
2361        2013  304  anual  sdasd    NaN  305.0  304.0
            2012  305  anual  sdasd    NaN  305.0  304.0