Python:按年份按小时,日期和月份过滤Pandas中的DataFrame

时间:2016-10-18 20:41:52

标签: python datetime pandas dataframe

对熊猫来说,我不得不挖掘很多东西才能找到解决这个问题的方法。我想知道更好的方法来解决这个问题,考虑到我仍然需要解决边界问题。

我有一套10个微不足道的措施" Power"从2009年到2012年,我们希望获得所有年份的小时和日/月窗口(即按小时,日和月按年份分组)。

我得到的内容如下:

import pandas as pd
import numpy as np
import datetime

dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="10min")
df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])

def filter(df, day, month, hour, daysWindow, hoursWindow):
    """
    Filter a Dataframe by a date window and hour window grouped by years

    @type df: DataFrame
    @param df: DataFrame with dates and values

    @type day: int
    @param day: Day to focus on

    @type month: int
    @param month: Month to focus on

    @type hour: int
    @param hour: Hour to focus on

    @type daysWindow: int
    @param daysWindow: Number of days to perform the days window selection

    @type hourWindow: int
    @param hourWindow: Number of hours to perform the hours window selection

    @rtype: DataFrame
    @return: Returns a DataFrame with the
    """
    df_filtered = None
    grouped = df.groupby(lambda x : x.year)
    for year, groupYear in grouped:
        groupedMonthDay = groupYear.groupby(lambda x : (x.month, x.day))
        for monthDay, groupMonthDay in groupedMonthDay:
            if monthDay >= (month,day - daysWindow) and monthDay <= (month,day + daysWindow):
                new_df = groupMonthDay.ix[groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow), datetime.time(hour + hoursWindow))]
                if df_filtered is None:
                    df_filtered = new_df
                else:
                    df_filtered = df_filtered.append(new_df)
    return df_filtered

df_filtered = filter(df,day=8, month=10, hour=8, daysWindow=1, hoursWindow=1)
print len(df)
print len(df_filtered)

以输出方式返回:

>>> 
157825
117

当选择像1和小时2的小时时,这个代码需要对边界问题进行改进。即:

>>> filter(df,day=8, month=10, hour=1, daysWindow=1, hoursWindow=2)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "D:\tmp\test_filtro.py", line 40, in filter
    new_df = groupMonthDay.ix[groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow), datetime.time(hour + hoursWindow))]
ValueError: hour must be in 0..23

选择像1或30这样的日子会发生类似的问题。

如何改进此代码?

1 个答案:

答案 0 :(得分:0)

Updated code for filter function ensures there is no border issues:

import pandas as pd
import numpy as np
import datetime

dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="10min")
df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])

def filter(df, day, month, hour, minute=0, daysWindow=1, hoursWindow=1):
    """
    Filter a Dataframe by a date window and hour window grouped by years

    @type df: DataFrame
    @param df: DataFrame with dates and values

    @type day: int
    @param day: Day to focus on

    @type month: int
    @param month: Month to focus on

    @type hour: int
    @param hour: Hour to focus on

    @type daysWindow: int
    @param daysWindow: Number of days to perform the days window selection

    @type hoursWindow: int
    @param hourWindow: Number of hours to perform the hours window selection

    @rtype: DataFrame
    @return: Returns a DataFrame with the
    """
    df_filtered = None
    grouped = df.groupby(lambda x : x.year)
    for year, groupYear in grouped:
        date = datetime.date(year, month, day)
        dateStart = date - datetime.timedelta(days=daysWindow)
        dateEnd = date + datetime.timedelta(days=daysWindow+1)
        df_filtered_days = df[dateStart:dateEnd]
        timeStart = datetime.time(0 if hour-hoursWindow < 0 else hour-hoursWindow, minute)
        timeEnd = datetime.time(23 if hour+hoursWindow > 23 else hour+hoursWindow, minute)
        new_df = df_filtered_days.ix[df_filtered_days.index.indexer_between_time(timeStart, timeEnd)]
        if df_filtered is None:
            df_filtered = new_df
        else:
            df_filtered = df_filtered.append(new_df)
    return df_filtered

df_filtered = filter(df,day=8, month=10, hour=1, daysWindow=1, hoursWindow=2)
print len(df)
print len(df_filtered)

Output is:

>>> 
157825
174