我通常需要使用 pandas的数据透视功能将堆叠的长格式数据框转换为未堆叠的宽格式数据框。
据我所知How to pivot a dataframe并没有解决我的问题。
如果存在重复的条目,则透视失败,通常我会通过使用Excel的透视表检查数据并使用count()来汇总值来跟踪和修复这些重复的条目。这在大多数情况下(但并非总是如此)有效,但是我想知道是否有一种方法可以停留在jupyterlab中,而无需使用Excel就可以在数据中找到问题。
我有一个看起来像这样的数据框:
ISO3 Country Indicator Year Value
45 FRA France Domestic credit 2011 54.68
140 GBR United Kingdom Domestic credit 2011 89.39
141 USA United States Domestic credit 2011 93.10
217 FRA France Domestic credit 2012 37.41
368 GBR United Kingdom Domestic credit 2012 58.50
369 USA United States Domestic credit 2012 63.10
448 FRA France Domestic credit 2012 36.03
599 GBR United Kingdom Domestic credit 2013 50.95
600 USA United States Domestic credit 2013 63.40
679 FRA France Domestic credit 2014 36.63
830 GBR United Kingdom Domestic credit 2014 54.47
831 USA United States Domestic credit 2014 78.00
我想转换为这种格式(使用pivot_table创建,可以处理重复项,但是不正确)
Year 2011 2012 2013 2014
ISO3
FRA 54.68 36.72 NaN 36.63
GBR 89.39 58.50 50.95 54.47
USA 93.10 63.10 63.40 78.00
使用
extra_domestic_credit.pivot(index = 'ISO3', columns = 'Year', values = 'Value')
但这会导致
----------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-86-745170bb5af5> in <module>
----> 1 extra_domestic_credit.pivot(index = 'ISO3', columns = 'Year', values = 'Value')
~\Anaconda3\envs\scipy18jlab\lib\site-packages\pandas\core\frame.py in pivot(self, index, columns, values)
5192 """
5193 from pandas.core.reshape.reshape import pivot
-> 5194 return pivot(self, index=index, columns=columns, values=values)
5195
5196 _shared_docs['pivot_table'] = """
~\Anaconda3\envs\scipy18jlab\lib\site-packages\pandas\core\reshape\reshape.py in pivot(self, index, columns, values)
413 indexed = self._constructor_sliced(self[values].values,
414 index=index)
--> 415 return indexed.unstack(columns)
416
417
~\Anaconda3\envs\scipy18jlab\lib\site-packages\pandas\core\series.py in unstack(self, level, fill_value)
2897 """
2898 from pandas.core.reshape.reshape import unstack
-> 2899 return unstack(self, level, fill_value)
2900
2901 # ----------------------------------------------------------------------
~\Anaconda3\envs\scipy18jlab\lib\site-packages\pandas\core\reshape\reshape.py in unstack(obj, level, fill_value)
499 unstacker = _Unstacker(obj.values, obj.index, level=level,
500 fill_value=fill_value,
--> 501 constructor=obj._constructor_expanddim)
502 return unstacker.get_result()
503
~\Anaconda3\envs\scipy18jlab\lib\site-packages\pandas\core\reshape\reshape.py in __init__(self, values, index, level, value_columns, fill_value, constructor)
135
136 self._make_sorted_values_labels()
--> 137 self._make_selectors()
138
139 def _make_sorted_values_labels(self):
~\Anaconda3\envs\scipy18jlab\lib\site-packages\pandas\core\reshape\reshape.py in _make_selectors(self)
173
174 if mask.sum() < len(self.index):
--> 175 raise ValueError('Index contains duplicate entries, '
176 'cannot reshape')
177
ValueError: Index contains duplicate entries, cannot reshape
这是由于行217和448中的ISO3和Year行重复。这是一个人为的示例,在此我故意引入了错误,但如何找到问题,而又没有将df写入excel就阻止了重塑在那儿调查数据?