大熊猫塑造协方差数据

时间:2013-05-12 23:12:32

标签: python pandas time-series covariance

我需要在时间序列中进行简单的协方差分析。我的原始数据形状如下:

WEEK_END_DATE              TITLE_SHORT          SALES  
2012-02-25 00:00:00.000000 "Bob" (EBK)         1
                           "Bob" (EBK)         1
2012-03-31 00:00:00.000000 "Bob" (EBK)         1
                           "Bob" (EBK)         1
2012-03-03 00:00:00.000000 "Sally" (EBK)          1
2012-03-10 00:00:00.000000 "Sally" (EBK)          1
2012-03-17 00:00:00.000000 "Sally" (EBK)          1
                           "Sally" (EBK)          1
2012-04-07 00:00:00.000000 "Sally" (EBK)          1

如您所见,有一些重复。除非我遗漏了什么,否则我需要这些数据成为每个标题的一组向量,以便我可以使用numpy.cov。

问题:

如何在日期和名称中找到重复项并按总和进行累计?我一直在尝试使用PANDas groupby WEEK_END_DATE和TITTLE_SHORT,但它以我不理解的方式编入索引。

编辑: 具体来说,当我尝试df.groupby(["WEEK_END_DATE", "TITLE_SHORT"])时,我明白了:

>df.ix[0:3]

WEEK_END_DATE               TITLE_SHORT               
2012-02-04 00:00:00.000000  'SALEM'S LOT (EBK)            <pandas.core.indexing._NDFrameIndexer object a...
                            'TIS THE SEASON! (EBK)        <pandas.core.indexing._NDFrameIndexer object a...
                            (NOT THAT YOU ASKED) (EBK)    <pandas.core.indexing._NDFrameIndexer object a...
dtype: object

并尝试选择df.ix[1,]会收到此错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/series.py", line 613, in __getitem__
    return self.index.get_value(self, key)
  File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 1630, in get_value
    loc = self.get_loc(key)
  File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2285, in get_loc
    result = slice(*self.slice_locs(key, key))
  File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2226, in slice_locs
    start_slice = self._partial_tup_index(start, side='left')
  File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2250, in _partial_tup_index
    raise Exception('Level type mismatch: %s' % lab)
Exception: Level type mismatch: 3

1 个答案:

答案 0 :(得分:2)

我不完全确定我知道发生了什么,但这就是我的开始。首先,获取数据(看起来固定宽度给我):

>>> df = pd.read_fwf("weekend.dat", widths=(26, 20, 9), parse_dates=[0])
>>> df = df.fillna(method="ffill")
>>> df
        WEEK_END_DATE    TITLE_SHORT  SALES
0 2012-02-25 00:00:00    "Bob" (EBK)      1
1 2012-02-25 00:00:00    "Bob" (EBK)      1
2 2012-03-31 00:00:00    "Bob" (EBK)      1
3 2012-03-31 00:00:00    "Bob" (EBK)      1
4 2012-03-03 00:00:00  "Sally" (EBK)      1
5 2012-03-10 00:00:00  "Sally" (EBK)      1
6 2012-03-17 00:00:00  "Sally" (EBK)      1
7 2012-03-17 00:00:00  "Sally" (EBK)      1
8 2012-04-07 00:00:00  "Sally" (EBK)      1

然后聚合dups:

>>> g = df.groupby(["WEEK_END_DATE", "TITLE_SHORT"]).sum().reset_index()
>>> g
        WEEK_END_DATE    TITLE_SHORT  SALES
0 2012-02-25 00:00:00    "Bob" (EBK)      2
1 2012-03-03 00:00:00  "Sally" (EBK)      1
2 2012-03-10 00:00:00  "Sally" (EBK)      1
3 2012-03-17 00:00:00  "Sally" (EBK)      2
4 2012-03-31 00:00:00    "Bob" (EBK)      2
5 2012-04-07 00:00:00  "Sally" (EBK)      1

然后执行您需要的任何cov内容(请注意cov也是Series / DataFrame / GroupBy方法,因此您不需要特定地调用np.cov