通过工作日和时间从Pandas DatatimeIndex进行更好的选择

时间:2019-10-23 23:15:55

标签: python pandas

问题:按工作日和时间从熊猫DatetimeIndex中进行选择。例如,我想选择星期二20:00和星期五06:00之间的所有项目。

问题:是否有比我下面的解决方案更好的解决方案?

我有一个现有的解决方案(见下文),但由于以下原因,我不太喜欢它:

  • 它将时间戳转换为浮点数,并进行浮点数比较,通常会出现精度问题。
  • 使用人为设计的映射将丰富的数据类型转换为原始数据类型,看起来既不优雅,也不是pythonic。
  • 在周日(第6周)到周一(第0周)之间进行的选择需要特别处理(不属于以下示例)。

我的工作示例:

import pandas as pd
from datetime import time
import calendar

# The DatetimeIndex to selection from
idx = pd.date_range('2019-01-01', '2019-01-31', freq='H')

# Converts a datetime to a time-of-day fraction in [0, 1)
def datetime_to_time_frac(t):
    return t.hour / 24 + t.minute / (24 * 60) + t.second / (24 * 60 * 60)

# Converts a datetime to a float representing weekday (Monday: 0 to Sunday: 6) + time-of-day fraction in [0, 1)
def datetime_to_weekday_time_frac(t):
    return t.weekday + datetime_to_time_frac(t)

# DatetimeIndex converted to float
idx_conv = datetime_to_weekday_time_frac(idx)

# Boolean mask selecting items between Tuesday 20:00 and Friday 06:00
mask = (idx_conv >= calendar.TUESDAY + datetime_to_time_frac(time(20, 0)))\
     & (idx_conv <= calendar.FRIDAY + datetime_to_time_frac(time(6, 0)))

# Validation of mask in a pivot table
df = pd.DataFrame(index=idx[mask])
df['Date'] = df.index.date
df['Weekday'] = df.index.weekday
weekdays = list(calendar.day_abbr)
df['WeekdayName'] = df.Weekday.map(lambda x: weekdays[x])
df['Hour'] = df.index.hour
df.pivot_table(index=['Date', 'WeekdayName'], columns='Hour', values='Weekday', aggfunc='count')

最终的透视输出显示代码可以正确地执行操作,但是我感觉有一种更优雅,更惯用的方式来解决此问题。

(代码基于带有最新Pandas的Python 3。)

Pivoted final output for code validation

2 个答案:

答案 0 :(得分:0)

似乎您可以使用pandas中的内部索引功能对其进行更清晰的索引。我避免将时间转换为小数时间,并且可以肯定的是,我所做的工作只能持续整个小时。基本区别是使用熊猫内置功能,并避免将calendars导入。这是我所做的,大多数情况下都相当于您非常具体的Tues-Fri示例,但是如果您只需要一个小时间隔,则可以将其调整为更通用的情况。

import pandas as pd

idx = pd.date_range('2019-01-01', '2019-01-31', freq='H')
df = pd.DataFrame(index=idx)

# Build a series of filters for each part of your weekly interval.
tues = (df.index.weekday == 1) & (df.index.hour >= 6)
weds_thurs = df.index.weekday.isin([2,3])
fri = (df.index.weekday == 4) & (df.index.hour <= 20)

# The mask is just the union of all those conditions
mask = tues | weds_thurs | fri

# now apply the mask and the rest is basically what you were doing
df = df.loc[mask]
df['Date'] = df.index.date
df['Weekday'] = df.index.weekday
df['WeekdayName'] = df.index.weekday_name
df['Hour'] = df.index.hour
df.pivot_table(index=['Date', 'WeekdayName'], columns='Hour', values='Weekday', aggfunc='count')

现在,我看到如下所示的输出: enter image description here

答案 1 :(得分:0)

以下内容应能满足您的需求:

def make_date_mask(day_start, time_start, day_end, time_end, series):
    flipped = False
    if day_start > day_end:
        # Need to flip the ordering, then negate at the end
        day_start, time_start, day_end, time_end = (
            day_end, time_end, day_start, time_start
        )
        flipped = True

    time_start = datetime.strptime(time_start, "%H:%M:%S").time()
    time_end = datetime.strptime(time_end, "%H:%M:%S").time()

    # Get everything for the specified days, inclusive
    mask = series.dt.dayofweek.between(day_start, day_end)
    # Filter things that happen before the begining of the start time
    # of the start day
    mask = mask & ~(
        (series.dt.dayofweek == day_start) 
        & (series.dt.time < time_start)
    )
    # Filter things that happen after the ending time of the end day
    mask = mask & ~(
        (series.dt.dayofweek == day_end) 
        & (series.dt.time > time_end)
    )

    if flipped:
        # Negate the mask to get the actual result and add in the
        # times that were exactly on the boundaries, just in case
        mask = ~mask | (
            (series.dt.dayofweek == day_start) 
            & (series.dt.time == time_start)
        ) | (
            (series.dt.dayofweek == day_end) 
            & (series.dt.time == time_end)
        )
    return mask

在您的示例中使用它:

import pandas as pd

df = pd.DataFrame({
    "dates": pd.date_range('2019-01-01', '2019-01-31', freq='H')
})
filtered_df = df[make_date_mask(6, "23:00:00", 0, "00:30:00", df["dates"])]

filtered如下:

                  dates
143 2019-01-06 23:00:00
144 2019-01-07 00:00:00
311 2019-01-13 23:00:00
312 2019-01-14 00:00:00
479 2019-01-20 23:00:00
480 2019-01-21 00:00:00
647 2019-01-27 23:00:00
648 2019-01-28 00:00:00