我有一个名为'bal'的数据框。它看起来像这样:
ano id unit period
business_id
9564 2012 302 sdasd anual
9564 2011 303 sdasd anual
2361 2013 304 sdasd anual
2361 2012 305 sdasd anual
...
我正在运行以下代码:
bal=bal.merge(bal.pivot(columns='ano', values='id'),right_index=True,left_index=True)
我的意图是把它变成这样的东西:
ano id unit period 2006 2007 2008 2009 2010 \
business_id
72 2013 774 sdasd anual NaN NaN NaN NaN NaN
72 2012 775 sdasd anual NaN NaN NaN NaN NaN
74 2012 1120 sdasd anual NaN NaN NaN NaN NaN
119 2013 875 sdasd anual NaN NaN NaN NaN NaN
119 2012 876 sdasd anual NaN NaN NaN NaN NaN
...
当我编码时,我收到此错误:
ValueError: Index contains duplicate entries, cannot reshape
为避免重复,我添加了一个drop_duplicates行:
bal=bal.drop_duplicates()
bal=bal.merge(bal.pivot(columns='ano', values='id'),right_index=True,left_index=True)
当我运行代码时,我有同样的问题:
ValueError: Index contains duplicate entries, cannot reshape
我做错了什么或误解了什么?
修改
bal
是我使用以下代码从SQL创建的数据框:
bal=pd.read_sql('select * from table;',connection).set_index('business_id')[['ano','id','unit','period']]
奇怪的是,如果我限制SQL查询,它可以正常工作:
bal=pd.read_sql('select * from table limit 1000;',connection).set_index('business_id')[['ano','id','unit','period']]
我认为这个问题可能与索引有很多重复这一事实有关(正如你在上面的例子中看到的那样)。但是,如果我print(bal.head(4))
在这个有限的bal中,它看起来与您在上面看到的完全相同,索引会重复。
答案 0 :(得分:3)
<强> UPDATE2:强>
qry = "select distinct business_id,ano,id,unit,period from table where period='anual'"
bal=pd.read_sql(qry, connection, index_col=['business_id'])
假设我们得到以下DF(ano
列中仍有重复值):
In [167]: bal
Out[167]:
ano id unit period
business_id
9564 2012 302 sdasd anual
9564 2012 299 sdasd anual
9564 2011 303 sdasd anual
2361 2013 304 sdasd anual
2361 2012 305 sdasd anual
我们可以这样做:
In [169]: bal.join(bal.pivot_table(index=bal.index, columns='ano',
values='id', aggfunc='first'))
Out[169]:
ano id unit period 2011 2012 2013
business_id
2361 2013 304 sdasd anual NaN 305.0 304.0
2361 2012 305 sdasd anual NaN 305.0 304.0
9564 2012 302 sdasd anual 303.0 302.0 NaN
9564 2012 299 sdasd anual 303.0 302.0 NaN
9564 2011 303 sdasd anual 303.0 302.0 NaN
<强>更新强>
考虑以下样本DF:
In [161]: bal
Out[161]:
ano id unit period
business_id
9564 2012 302 sdasd anual
9564 2012 299 sdasd anual # i've intentionally added this row with duplicated `ano`
9564 2011 303 sdasd anual
2361 2013 304 sdasd anual
2361 2012 305 sdasd anual
重现你的错误:
In [162]: bal.pivot(columns='ano', values='id')
...
skipped
...
ValueError: Index contains duplicate entries, cannot reshape
旧回答:
这就是你想要的吗?
In [144]: bal.join(bal.pivot(columns='ano', values='id'))
Out[144]:
ano id unit period 2011 2012 2013
business_id
2361 2013 304 sdasd anual NaN 305.0 304.0
2361 2012 305 sdasd anual NaN 305.0 304.0
9564 2012 302 sdasd anual 303.0 302.0 NaN
9564 2011 303 sdasd anual 303.0 302.0 NaN
答案 1 :(得分:2)
考虑使用unstack()
和merge()
- 这将解决重复问题。
# sample data
data = {"business_id":[9564, 9564, 2361, 2361],
"ano":[2012, 2011, 2013, 2012],
"id":[302,303,304,305],
"unit":["sdasd"]*4,
"period":["anual"]*4}
df = pd.DataFrame(data)
# include ano for MultiIndex
df.set_index(["business_id","ano"], inplace=True)
df
id period unit
business_id ano
9564 2012 302 anual sdasd
2011 303 anual sdasd
2361 2013 304 anual sdasd
2012 305 anual sdasd
现在unstack()
,抓取id
数据和merge()
。最里面的级别是未堆叠的,这就是为什么我们将ano
添加到上面的索引。
df.merge(df.unstack()['id'], right_index=True, left_index=True)
id period unit 2011 2012 2013
business_id ano
9564 2012 302 anual sdasd 303.0 302.0 NaN
2011 303 anual sdasd 303.0 302.0 NaN
2361 2013 304 anual sdasd NaN 305.0 304.0
2012 305 anual sdasd NaN 305.0 304.0