我对熊猫和蟒蛇还是很陌生,我担心我在这里做些蠢事。也就是说,我遇到的问题最接近我遇到的问题是How to create pivot with totals (margins) in Pandas?,所以我问。
我有一个包含3列的简单数据框。
Account ID Amount Close Date
0 10a 100 2009-01-01
1 10a 50 2009-01-01
2 10a 100 2010-04-01
3 10a 100 2011-04-01
4 10a 100 2012-05-01
.. ... ... ...
35 4b .5 2009-01-01
36 4c .5 2009-01-01
37 5a .5 2009-01-01
38 5b .5 2009-01-01
39 8a .5 2009-01-01
我认为我在关闭日期栏时遇到了问题。我怀疑大熊猫不知道2009-01-01等于另一个2009-01-01。
我想透过这个表来获取这样的输出,在那里我可以看到事先按帐户ID分组,然后是关闭日期。如果一个帐户ID有多个具有相同关闭日期的行,我希望这些金额在值列中添加,就像这样。 (为了记录,我真的只对这一年感兴趣,但在拍摄问题时我一直在尽量简化。)
Account ID Close Date
2c 2009-01-01 100
2011-01-01 100
10a 2009-01-01 150
2010-04-01 100
...
我已经尝试了各种各样的事情,并且继续遇到问题,这些问题让我有了一些日期问题。也许我需要导入一个不同的库?
这是我最近的尝试:
pd.pivot_table(opps, index=['Account ID'], columns = 'Close Date', values=['Amount'], aggfunc=np.su
米)
并且输出非常接近我想要的。
唯一的问题是,对于任何有两行日期的帐户ID,该数据只会在输出中消失。对于2009-01-01,帐户10a有3行,但在数据透视表中显示2009-01-01 Nan。
我以为我会尝试使用margin = True的相同数据透视表。
当我这样做时,我收到了一条错误消息。
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-182-f8dc0d75c868> in <module>()
3 margins = "True",
4 values=['Amount'],
----> 5 aggfunc=np.sum)
/Applications/anaconda/lib/python2.7/site-packages/pandas/tools/pivot.pyc in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna)
141 if margins:
142 table = _add_margins(table, data, values, rows=index,
--> 143 cols=columns, aggfunc=aggfunc)
144
145 # discard the top level
/Applications/anaconda/lib/python2.7/site-packages/pandas/tools/pivot.pyc in _add_margins(table, data, values, rows, cols, aggfunc)
167
168 if values:
--> 169 marginal_result_set = _generate_marginal_results(table, data, values, rows, cols, aggfunc, grand_margin)
170 if not isinstance(marginal_result_set, tuple):
171 return marginal_result_set
/Applications/anaconda/lib/python2.7/site-packages/pandas/tools/pivot.pyc in _generate_marginal_results(table, data, values, rows, cols, aggfunc, grand_margin)
236 # we are going to mutate this, so need to copy!
237 piece = piece.copy()
--> 238 piece[all_key] = margin[key]
239
240 table_pieces.append(piece)
/Applications/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
1795 return self._getitem_multilevel(key)
1796 else:
-> 1797 return self._getitem_column(key)
1798
1799 def _getitem_column(self, key):
/Applications/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
1802 # get column
1803 if self.columns.is_unique:
-> 1804 return self._get_item_cache(key)
1805
1806 # duplicate columns & possible reduce dimensionaility
/Applications/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
1082 res = cache.get(item)
1083 if res is None:
-> 1084 values = self._data.get(item)
1085 res = self._box_item_values(item, values)
1086 cache[item] = res
/Applications/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
2849
2850 if not isnull(item):
-> 2851 loc = self.items.get_loc(item)
2852 else:
2853 indexer = np.arange(len(self.items))[isnull(self.items)]
/Applications/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in get_loc(self, key, method)
1570 """
1571 if method is None:
-> 1572 return self._engine.get_loc(_values_from_object(key))
1573
1574 indexer = self.get_indexer([key], method=method)
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12280)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12231)()
KeyError: Timestamp('2009-01-01 00:00:00')
感谢您提供任何建议。
答案 0 :(得分:0)
听起来像是一个小组,而不是一个数据透视表给我 - 你的列是固定的。
例如:
import pandas as pd
from datetime import date
df = pd.DataFrame(data=[['10a', 100, date(2009, 1, 1)],
['10a', 50, date(2009, 1, 1)],
['10a', 100, date(2010, 4, 1)],
['10a', 100, date(2011, 4, 1)],
['10a', 100, date(2012, 5, 1)],
['4b', .5, date(2009, 1, 1)],
['4c', .5, date(2009, 1, 1)],
['5a', .5, date(2009, 1, 1)],
['5b', .5, date(2009, 1, 1)],
['8a', .5, date(2009, 1, 1)]],
columns=['Account ID', 'Amount', 'Close Date'])
df.groupby(['Account ID', 'Close Date']).sum()
给出:
Amount
Account ID Close Date
10a 2009-01-01 150.0
2010-04-01 100.0
2011-04-01 100.0
2012-05-01 100.0
4b 2009-01-01 0.5
4c 2009-01-01 0.5
5a 2009-01-01 0.5
5b 2009-01-01 0.5
8a 2009-01-01 0.5
如果我错过了什么,请道歉。
与数据透视表的等价物是:
df.pivot_table(index=['Account ID', 'Close Date'], values=['Amount'], aggfunc=np.sum)