Question

我有一个pandas数据框如下：

date                | Item   | count
------------------------------------
2016-12-06 10:45:08 |  Item1 |  60
2016-12-06 10:45:08 |  Item2 |  145
2016-12-06 09:45:00 |  Item1 |  60
2016-12-06 09:44:54 |  Item3 |  600
2016-12-06 09:44:48 |  Item4 |  15
2016-12-06 11:45:08 |  Item1 |  60
2016-12-06 10:45:08 |  Item2 |  14
2016-11-06 09:45:00 |  Item1 |  62
2016-11-06 09:44:54 |  Item3 |  6
2016-11-06 09:44:48 |  Item4 |  15

我想通过让我们说一天中的小时（或稍后的一天）知道以下统计数据来组合项目：每天销售的商品清单，例如：

在2016-12-06上，从09:00:00到10:00:00，出售了Item1，Item3和Item4;等等。
在2016-12-06上，出售了Item1，Item2，Item3，Item4（独特商品）。

虽然我远远没有获取这些统计数据，但我仍然坚持按时间分组。最初，print df.dtypes显示

date    object
Item    object
count   int64
dtype: object

因此，我使用以下代码行将date列转换为pandas日期对象。

df['date'] = pd.to_datetime(df['date'])

现在，print df.dtypes产生：

date    datetime64[ns]
Item    object
count   int64
dtype: object

但是，当我尝试通过执行以下代码行使用date对TimeGrouper列进行分组时

from pandas.tseries.resample import TimeGrouper 
print df.groupby([df['date'],pd.TimeGrouper(freq='Min')])

我得到以下TypeError。根据{{3}}或here提供的建议，使用pd.to_datetime进行转换应该可以解决此问题。

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'

我不知道如何解决此问题以继续我正在寻找的统计数据。任何有关解决此错误并使用TimeGrouper以字典格式（或任何更有意义的方式）搜索统计信息的提示都将非常感激。

Answer 1

您可以groupby使用numpy array - 已移除minutes和seconds的日期时间：

print (df['date'].values.astype('<M8[h]'))
['2016-12-06T10' '2016-12-06T10' '2016-12-06T09' '2016-12-06T09'
 '2016-12-06T09' '2016-12-06T11' '2016-12-06T10' '2016-11-06T09'
 '2016-11-06T09' '2016-11-06T09']

print (df.groupby(df['date'].values.astype('<M8[h]')).Item.unique())
2016-11-06 09:00:00    [Item1, Item3, Item4]
2016-12-06 09:00:00    [Item1, Item3, Item4]
2016-12-06 10:00:00           [Item1, Item2]
2016-12-06 11:00:00                  [Item1]
Name: Item, dtype: object

print (df.groupby(df['date'].values.astype('<M8[h]')).Item
         .apply(lambda x: x.unique().tolist()).to_dict())
{Timestamp('2016-11-06 09:00:00'): ['Item1', 'Item3', 'Item4'], 
 Timestamp('2016-12-06 09:00:00'): ['Item1', 'Item3', 'Item4'], 
 Timestamp('2016-12-06 10:00:00'): ['Item1', 'Item2'], 
 Timestamp('2016-12-06 11:00:00'): ['Item1']}

print (df.groupby(df['date'].values.astype('<M8[D]')).Item
         .apply(lambda x: x.unique().tolist()).to_dict())
{Timestamp('2016-11-06 00:00:00'): ['Item1', 'Item3', 'Item4'], 
 Timestamp('2016-12-06 00:00:00'): ['Item1', 'Item2', 'Item3', 'Item4']}

感谢您Jeff建议使用round：

print (df.groupby(df['date'].dt.round('h')).Item
         .apply(lambda x: x.unique().tolist()).to_dict())

{Timestamp('2016-11-06 10:00:00'): ['Item1', 'Item3', 'Item4'], 
 Timestamp('2016-12-06 12:00:00'): ['Item1'], 
 Timestamp('2016-12-06 10:00:00'): ['Item1', 'Item3', 'Item4'], 
 Timestamp('2016-12-06 11:00:00'): ['Item1', 'Item2']}

print (df.groupby(df['date'].dt.round('d')).Item
         .apply(lambda x: x.unique().tolist()).to_dict())
{Timestamp('2016-11-06 00:00:00'): ['Item1', 'Item3', 'Item4'], 
 Timestamp('2016-12-06 00:00:00'): ['Item1', 'Item2', 'Item3', 'Item4']}

Answer 2

sold = df.set_index('date').Item.resample('H').agg({'Sold': 'unique'})
sold[sold.Sold.str.len() > 0]

                                      Sold
date                                      
2016-11-06 09:00:00  [Item4, Item3, Item1]
2016-12-06 09:00:00  [Item4, Item3, Item1]
2016-12-06 10:00:00         [Item1, Item2]
2016-12-06 11:00:00                [Item1]

熊猫组每天一小时到字典

2 个答案: