在pandas数据框中添加一周中缺少的一天和一天中的时间

时间:2017-03-13 18:48:37

标签: python python-2.7 pandas datetime

我有一个如下所示的pandas数据框:Dataframe

时间有水平:清晨,早晨,下午,晚上,深夜

目标是通过在两次观察之间添加一周的缺失日期和一天中的时间来使数据均匀。例如,如果当前行为Wednesday, Early Morning而下一行为Thursday, Morning,我想添加:

Wednesday Morning
Wednesday Afternoon
Wednesday Evening
Wednesday late night
Thursday Early Morning

作为两者之间的行。到目前为止,我所尝试的是将周和日级别转换为数字,然后减去它们以得出它们之间添加的天数或时间。 DataFrame2

我想知道是否有更有效的方式来进行这项工作。这是我写的代码:

for i1, col1 in dfMod.iterrows():
    if col1['day'] == "MONDAY":
        dfMod.ix[i1,'weekIndex'] = 1
    elif col1['day'] == "TUESDAY":
        dfMod.ix[i1,'weekIndex'] = 2
    elif col1['day'] == "WEDNESDAY":
        dfMod.ix[i1,'weekIndex'] = 3
    elif col1['day'] == "THURSDAY":
        dfMod.ix[i1,'weekIndex'] = 4
    elif col1['day'] == "FRIDAY":
        dfMod.ix[i1,'weekIndex'] = 5
    elif col1['day'] == "SATURDAY":
        dfMod.ix[i1,'weekIndex'] = 6
    else:
        dfMod.ix[i1,'weekIndex'] = 7

    if col1['timeType'] == "EARLY MORNING":
        dfMod.ix[i1,'dayIndex'] = 1
    elif col1['timeType'] == "MORNING":
        dfMod.ix[i1,'dayIndex'] = 2
    elif col1['timeType'] == "AFTERNOON":
        dfMod.ix[i1,'dayIndex'] = 3
    elif col1['timeType'] == "EVENING":
        dfMod.ix[i1,'dayIndex'] = 4
    else:
        dfMod.ix[i1,'dayIndex'] = 5
dfMod = dfMod.reset_index(drop= True)
dfMod.leadWeek = dfMod.groupby('adId')['weekIndex'].shift(-1)
dfMod.leadDay = dfMod.groupby('adId')['dayIndex'].shift(-1)
dfMod['diffWeek'] = dfMod['leadWeek'] - dfMod['weekIndex']
dfMod['diffDay'] = dfMod['leadDay'] - dfMod['dayIndex']
dfMod.head()

1 个答案:

答案 0 :(得分:0)

这是解决问题的一种方法。这是一个继承自datetime.datetime的类,它提供了一些处理字符串以转换为datetime的方法。拥有日期时间的好处是,您可以使用pandas与其相关的各种方法。在示例中,我使用resample扩展您的框架并用您想要的数据填充它。

此处说明的另一件事是使用dict将一件事转换成另一件事。这种类型的结构通常优先于堆叠的if

<强>代码:

import datetime as dt

class penta_datetime(dt.datetime):
    """ class which cleaves a day into five pieces """
    pandas_period = '288min'

    part_day = dt.timedelta(minutes=(24 * 60 / 5))
    base_date = dt.datetime.combine(
        dt.date.today(), dt.datetime.min.time()) - dt.timedelta(
        (dt.date.today().weekday() + 1) % 7)

    day_index_offset = {
        "LATE NIGHT": part_day * 0,
        "EARLY MORNING": part_day * 1,
        "MORNING": part_day * 2,
        "AFTERNOON": part_day * 3,
        "EVENING": part_day * 4,
    }

    dow_index = {
        'MONDAY': 1,
        'TUESDAY': 2,
        'WEDNESDAY': 3,
        'THURSDAY': 4,
        'FRIDAY': 5,
        'SATURDAY': 6,
        'SUNDAY': 7,
    }

    @classmethod
    def to_week(cls, date):
        return cls.combine(date, dt.datetime.min.time()) - dt.timedelta(
            (date.weekday() + 1) % 7)

    @classmethod
    def from_strings(cls, dow, tod, week=None):
        if week is None:
            week = cls.now()
        return (cls.to_week(week) +
                dt.timedelta(days=cls.dow_index[dow.upper()]) +
                cls.day_index_offset[tod.upper()])

    @classmethod
    def from_datetime(cls, datetime):
        return cls.combine(datetime.date(), datetime.time())

    @property
    def phase_of_day(self):
        return self.day_offset_index[self.time()]

    @property
    def dow_string(self):
        return self.dow_strings[self.isoweekday()]

penta_datetime.day_offset_index = {
    (penta_datetime.base_date + v).time(): k
    for k, v in penta_datetime.day_index_offset.items()}
penta_datetime.dow_strings = {
    v: k for k, v in penta_datetime.dow_index.items()}

测试代码:

import pandas as pd

df = pd.DataFrame([
    [1, 'WEDNESDAY', 'LATE NIGHT'],
    [1, 'WEDNESDAY', 'EARLY MORNING'],
    [2, 'WEDNESDAY', 'EVENING'],
    [3, 'SATURDAY',  'MORNING'],
    [2, 'SATURDAY',  'AFTERNOON'],
], columns=['ad_id', 'day_of_week', 'time_of_day'])
print(df)

def convert_to_datetime(row):
    return penta_datetime.from_strings(row.day_of_week, row.time_of_day)

# make a copy of the dataframe
ids = df.copy()

# convert the strings into a datetime
ids['ts'] = df.apply(convert_to_datetime, axis=1)

# set the timestamps as the index
ids.set_index(['ts'], inplace=True)

# resample to 5 times a day, and pad the data into the holes
ids = ids.resample(penta_datetime.pandas_period).pad().reset_index()

# (optional) convert the strings to match extended timestamps
ids['time_of_day'] = ids['ts'].apply(
    lambda ts: penta_datetime.from_datetime(ts).phase_of_day)
ids['day_of_week'] = ids['ts'].apply(
    lambda ts: penta_datetime.from_datetime(ts).dow_string)
print(ids)

<强>结果:

   ad_id day_of_week    time_of_day
0      1   WEDNESDAY     LATE NIGHT
1      1   WEDNESDAY  EARLY MORNING
2      2   WEDNESDAY        EVENING
3      3    SATURDAY        MORNING
4      2    SATURDAY      AFTERNOON

                    ts  ad_id day_of_week    time_of_day
0  2017-03-15 00:00:00      1   WEDNESDAY     LATE NIGHT
1  2017-03-15 04:48:00      1   WEDNESDAY  EARLY MORNING
2  2017-03-15 09:36:00      1   WEDNESDAY        MORNING
3  2017-03-15 14:24:00      1   WEDNESDAY      AFTERNOON
4  2017-03-15 19:12:00      2   WEDNESDAY        EVENING
5  2017-03-16 00:00:00      2    THURSDAY     LATE NIGHT
6  2017-03-16 04:48:00      2    THURSDAY  EARLY MORNING
7  2017-03-16 09:36:00      2    THURSDAY        MORNING
8  2017-03-16 14:24:00      2    THURSDAY      AFTERNOON
9  2017-03-16 19:12:00      2    THURSDAY        EVENING
10 2017-03-17 00:00:00      2      FRIDAY     LATE NIGHT
11 2017-03-17 04:48:00      2      FRIDAY  EARLY MORNING
12 2017-03-17 09:36:00      2      FRIDAY        MORNING
13 2017-03-17 14:24:00      2      FRIDAY      AFTERNOON
14 2017-03-17 19:12:00      2      FRIDAY        EVENING
15 2017-03-18 00:00:00      2    SATURDAY     LATE NIGHT
16 2017-03-18 04:48:00      2    SATURDAY  EARLY MORNING
17 2017-03-18 09:36:00      3    SATURDAY        MORNING
18 2017-03-18 14:24:00      2    SATURDAY      AFTERNOON

注意:

你对如何处理几周内的变化并不十分清楚,所以这只是简单的,并留给读者练习。