groupby multi colums并将其更改为dataFrame / array

时间:2017-10-29 22:36:30

标签: python arrays pandas dataframe group-by

您好我有这样的dataFrame:



                       Value         day  hour  min
Time                                                         
2015-12-19 10:08:52     1805  2015-12-19    10    8
2015-12-19 10:09:52     1794  2015-12-19    10    9
2015-12-19 10:19:51     1796  2015-12-19    10   19
2015-12-19 10:20:51     1806  2015-12-19    10   20
2015-12-19 10:29:52     1802  2015-12-19    10   29
2015-12-19 10:30:52     1800  2015-12-19    10   30
2015-12-19 10:40:51     1804  2015-12-19    10   40
2015-12-19 10:41:51     1798  2015-12-19    10   41
2015-12-19 10:50:51     1790  2015-12-19    10   50
2015-12-19 10:51:52     1811  2015-12-19    10   51
2015-12-19 11:00:51     1803  2015-12-19    11    0
2015-12-19 11:01:52     1784  2015-12-19    11    1
                         ...    ...         ...   ...  ...
2016-07-15 17:30:13     1811  2016-07-15    17   30
2016-07-15 17:31:13     1787  2016-07-15    17   31
2016-07-15 17:41:13     1800  2016-07-15    17   41
2016-07-15 17:42:13     1795  2016-07-15    17   42




我希望按天和小时对其进行分组,最后将其作为" Value"的多维数组。像这样的列:

基于日和小时的分组,我需要每小时得到这样的东西:



2015-12-19  10 [1805, 1794, 1796, 1806, 1802, 1800, 1804, 179...  ]
2015-12-20  11 [1803, 1793, 1795, 1801, 1796, 1796, 1788, 180...  ]
...  
2016-07-15  17 [1794, 1792, 1788, 1799, 1811, 1803, 1808, 179... ]




最后,我希望我能拥有这样的数据框:



Time_index  hour    value1 value2 value3 ........value20

2015-12-19  10    1805, 1794, 1796, 1806 ... 1804, 1791, 1788, 1812  
2015-12-20  11    1803, 1793, 1795, 1801 ... 1796, 1796, 1788, 1800 
...  
2016-07-15  17    1794, 1792, 1788, 1799 ... 1811, 1803, 1808, 1790




或者像这样的数组:



[[1805, 1794, 1796, 1806, 1802, 1800, 1804, 179...  ],[1803, 1793, 1795, 1801, 1796, 1796, 1788, 180...  ]....[1794, 1792, 1788, 1799, 1811, 1803, 1808, 179... ]]




我能够通过一个列工作得到groupby:



grouped_0 = train_df.groupby(['day'])
grouped = grouped_0.aggregate(lambda x: list(x))
grouped['grouped'] = grouped['Value']




dataFrame的输出分组' s'分组'列就像:



2015-12-19  [1805, 1794, 1796, 1806, 1802, 1800, 1804, 179...  
2015-12-20  [1790, 1809, 1809, 1789, 1807, 1804, 1790, 179...  
2015-12-21  [1794, 1792, 1788, 1799, 1811, 1803, 1808, 179...  
2015-12-22  [1815, 1812, 1798, 1808, 1802, 1788, 1808, 179...  
2015-12-23  [1803, 1800, 1799, 1803, 1802, 1804, 1788, 179...  
2015-12-24  [1803, 1795, 1801, 1798, 1799, 1802, 1799, 179...




然而,当我尝试这个时:



grouped_0 = train_df.groupby(['day', 'hour'])
grouped = grouped_0.aggregate(lambda x: list(x))
grouped['grouped'] = grouped['Value']




它抛出了这个错误:



Traceback (most recent call last):
  File "<input>", line 3, in <module>
  File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 4036, in aggregate
    return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
  File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 3476, in aggregate
    return self._python_agg_general(arg, *args, **kwargs)
  File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 848, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)
  File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 2180, in agg_series
    return self._aggregate_series_pure_python(obj, func)
  File "C:\Apps\Continuum\Anaconda2\envs\python36\lib\site-packages\pandas\core\groupby.py", line 2215, in _aggregate_series_pure_python
    raise ValueError('Function does not reduce')
ValueError: Function does not reduce
&#13;
&#13;
&#13;

我的熊猫版: 概率pd。的版本 &#39; 0.20.3&#39;

1 个答案:

答案 0 :(得分:1)

是的,使用agg这不是最好的主意,因为除非结果是具有单个对象的容器,否则结果将被视为无效。

您可以使用groupby + apply

g = df.groupby(['day', 'hour']).Value.apply(lambda x: x.values.tolist())
g

day         hour
2015-12-19  10      [1805, 1794, 1796, 1806, 1802, 1800, 1804, 179...
            11                                           [1803, 1784]
2016-07-15  17                               [1811, 1787, 1800, 1795]
Name: Value, dtype: object

如果您希望每个元素都在自己的列中,您可以这样做:

v = pd.DataFrame(g.values.tolist(), index=g.index)\
       .rename(columns=lambda x: 'value{}'.format(x + 1)).reset_index()

v是您的最终结果。