Question

我有不同实体的数据记录，并且每个实体在整个月的一天中的特定时间记录了一些计数。例如：

     entity_id    time              counts
0      175  2019-03-01 05:00:00       3
1      175  2019-03-01 06:00:00       4
2      175  2019-03-01 07:00:00       6
3      175  2019-03-01 08:00:00       6
4      175  2019-03-01 09:00:00       7
5      178  2019-03-01 05:00:00       8
6      178  2019-03-01 06:00:00       4
7      178  2019-03-01 07:00:00       5
8      178  2019-03-01 08:00:00       6
9      200  2019-03-01 05:00:00       7
10     200  2019-03-01 08:00:00       3
11     175  2019-03-03 05:00:00       3
12     175  2019-03-03 07:00:00       6
13     175  2019-03-03 08:00:00       6
14     175  2019-03-03 09:00:00       7
15     178  2019-03-03 05:00:00       8
16     178  2019-03-03 06:00:00       4
17     178  2019-03-03 07:00:00       5
18     178  2019-03-03 08:00:00       6
19     200  2019-03-03 05:00:00       7
20     200  2019-03-03 08:00:00       3
21     200  2019-03-03 09:00:00       7
...

我希望能够为每个实体汇总一个月中一周中不同日期在几个小时范围内的计数平均值。例如：

周日早晨（6-10 AM）的平均值
周日至周四早上（6-10 AM）的平均值
周日至周四中午（11 AM-1PM）的平均值
周五至周六中午（11 AM-1PM）的平均值
星期五晚上（6 PM-9PM）的平均值
等

所以我希望得到这样的df（部分示例）：

     entity_id day_in_week time_in_day counts_mean
0      175     sun         eve         5
1      175     sun-thu     noon        6
2      178     sun         eve         5
3      178     sat         eve         5
4      200     sun-thu     morning     2
...

我设法通过遍历数据，切片和提取不同的元素来部分完成此操作，但是我认为有一种更有效的方法。

我从this issue开始，但是我仍然有太多的for循环。有什么想法可以优化性能吗？

Answer 1

如果您的时间列是熊猫中的datetime对象，则可以使用datatime方法创建新列，

您可以按照以下步骤操作，

您可以创建一列以将day_in_week表示为

df["day_in_week"] = df["time"].dt.dayofweek

然后使用一个简单的.apply函数根据您的要求创建列，以通过比较函数内部的时间将时间分为早上，晚上等时段。
然后根据之前创建的两列创建另一列，指示您的组合。
然后在要获取该组分组数据或度量的列上使用groupby。

我知道这个过程有点长，但是它没有任何for循环，它使用了熊猫已经提供的df.apply和datetime属性以及一些if-else条件（根据您的要求）

步骤2、3、4完全取决于数据，因为我没有数据，因此无法编写确切的代码。我尽力解释了可以使用的方法。

我希望这会有所帮助。

Answer 2

我的解决方案的思想是基于带有定义的辅助DataFrame 范围，要计算其平均值（ day_in_week ， time_in_day 以及上述属性的 CustomBusinessHour ）。

创建此DataFrame（我称其为 calendars ）始于 day_in_week ， time_in_day 列：

calendars = pd.DataFrame([
    ['sun',     'morning'],
    ['sun-thu', 'morning'],
    ['sun-thu', 'noon'],
    ['fri-sat', 'noon'],
    ['fri',     'eve']],
    columns=['day_in_week', 'time_in_day'])

如果您想要更多此类定义，请在此处添加。

然后，添加相应的 CustomBusinessHour 对象：

定义一个函数以获取小时限制：

def getHourLimits(name):
    if name == 'morning':
        return '06:00', '10:00'
    elif name == 'noon':
        return '11:00', '13:00'
    elif name == 'eve':
        return '18:00', '21:00'
    else:
        return '8:00', '16:00'

定义一个获取星期掩码（开始时间和结束时间）的函数：

def getWeekMask(name):
    parts = name.split('-')
    if len(parts) > 1:
        fullWeek = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
        ind1 = fullWeek.index(parts[0].capitalize())
        ind2 = fullWeek.index(parts[1].capitalize())
        return ' '.join(fullWeek[ind1 : ind2 + 1])
    else:
        return parts[0].capitalize()

定义一个生成 CustomBusinessHour 对象的函数：

def getCBH(row):
    wkMask = getWeekMask(row.day_in_week)
    hStart, hEnd = getHourLimits(row.time_in_day)
    return pd.offsets.CustomBusinessHour(weekmask=wkMask, start=hStart, end=hEnd)

向日历添加 CustomBusinessHour 对象：

calendars['CBH'] = calendars.apply(getCBH, axis=1)

然后定义一个函数，计算给定的所有所需的均值实体ID：

def getSums(entId):
    outRows = []
    wrk = df[df.entity_id.eq(entId)]    # Filter for entity Id
    for _, row in calendars.iterrows():
        dd = row.day_in_week
        hh = row.time_in_day
        cbh = row.CBH
        # Filter for the current calendar
        cnts = wrk[wrk.time.apply(lambda val: cbh.is_on_offset(val))]
        cnt = cnts.counts.mean()
        if pd.notnull(cnt):
            outRows.append(pd.Series([entId, dd, hh, cnt],
                index=['entity_id', 'day_in_week', 'time_in_day', 'counts_mean']))
    return pd.DataFrame(outRows)

如您所见，结果仅包含非null平均值。

要生成结果，请运行：

pd.concat([getSums(entId) for entId in df.entity_id.unique()], ignore_index=True)

对于您的数据样本（仅包含早晨的读数），结果是：

   entity_id day_in_week time_in_day  counts_mean
0        175         sun     morning     6.333333
1        175     sun-thu     morning     6.333333
2        178         sun     morning     5.000000
3        178     sun-thu     morning     5.000000
4        200         sun     morning     5.000000
5        200     sun-thu     morning     5.000000

Python Pandas重新采样了不同日期和日期范围内的特定时间

2 个答案: