时间序列已过滤数据帧的熊猫最小值

时间:2020-06-25 22:33:42

标签: python pandas

我是熊猫新人,我正在尝试解决下一个问题。

我有一个数据集(最初大约是2百万行)

test = pd.DataFrame({ 'Date' : ['2020-04-01','2020-04-02','2020-04-03',
                         '2020-04-04','2020-04-05','2020-04-06',
                         '2020-04-01','2020-04-02','2020-04-03'],
                  'Set' : ['Set1','Set1','Set1','Set1','Set1','Set1',
                         'Set2','Set2','Set2'],
                  'Type': ['Type1','Type1','Type1','Type1','Type1','Type1',
                         'Type1','Type1','Type1'],
                  'Category': ['Category1','Category1','Category1','Category1','Category1','Category1',
                         'Category2','Category2','Category2'],
                  'Rooms' : [6,5,4,7,2,9,3,5,1]
            })

我需要创建一个新列,每行将包含21个值的列表。

列表的每个值对应于每个(设置-类型-类别-日期)组合在接下来1到20天的最小房间数。

例如第一行包含[2020-04-01,Set1,Type1,Category1,6] 我需要遍历整个df,查找包含相同[Set1,Type1,Category1]的所有行,在此过滤后的部分中,我需要找到以下房间的最小值: 2020-04-01 + 0天,2020-04-01 + 1天,2020-04-01 + 2天... 2020-04-01 + 20天。

我想出了这段代码,它可以工作,但仅在df的一小部分上有效。当我尝试在整个df上运行时,它花费的时间是无限的。我确信可以使用groupby对其进行优化,但是我仍然无法正确使用它。

for i in range(len(test)):  #iterate through entire dataframe
    x=[]                    #create a list that will be added to the new column at the end
    current_set = test.loc[    #filter out needed part of the dataframe
            (test.Set== test.Set.loc[i]) &
            (test.Type == test.Type.loc[i]) &
            (test.Category == test.Category.loc[i]) &
            (test.Date >= test.loc[i,'Date']) &
            (test.Date <= test.loc[i,'Date']+pd.Timedelta(days=20))
            ]
    for n in range(1,22):         #run internal loop to fill out the x list by additionally filtering 
        if len(current_set)>=n:  #check if there is enough days after current date
            c_date = test.Date < test.Date.loc[i]+pd.Timedelta(days=n) #if True filter current_set to needed state
            x.append(min(current_room.loc[c_date,'Rooms']))
        else:
            x.append(0)

    test.at[i,'Min_Rooms']=x #add generated list to new column 

整个代码从一个单元格运行

test = pd.DataFrame({ 'Date' : ['2020-04-01','2020-04-02','2020-04-03',
                         '2020-04-04','2020-04-05','2020-04-06',
                         '2020-04-01','2020-04-02','2020-04-03'],
                  'Set' : ['Set1','Set1','Set1','Set1','Set1','Set1',
                         'Set2','Set2','Set2'],
                  'Type': ['Type1','Type1','Type1','Type1','Type1','Type1',
                         'Type1','Type1','Type1'],
                  'Category': ['Category1','Category1','Category1','Category1','Category1','Category1',
                         'Category2','Category2','Category2'],
                  'Rooms' : [6,5,4,7,2,9,3,5,1]
            })

# Convert 'Date' to daetime
test['Date'] = pd.to_datetime(test['Date'], format= '%Y/%m/%d')

# Create new column
test.at[0,'Min_Rooms'] = 1

#Convert to object type in order to insert lists
test['Min_Rooms'] = test['Min_Rooms'].astype(object)

# Start the loop
for i in range(len(test)):
    x=[]
    current_room = test.loc[
            (test.Set == test.Set.loc[i]) &
            (test.Type == test.Type.loc[i]) &
            (test.Category == test.Category.loc[i]) &
            (test.Date >= test.loc[i,'Date']) &
            (test.Date <= test.loc[i,'Date']+pd.Timedelta(days=20))
            ]
    for n in range(1,22):
        if len(current_room)>=n:
            c_date = test.Date < test.Date.loc[i]+pd.Timedelta(days=n)
            x.append(min(current_room.loc[c_date,'Rooms']))
        else:
            x.append(0)

    test.at[i,'Min_Rooms']=x

print(test.to_markdown())

我的天数从21天减少到7天,这是我的结果。 如您在第一行中看到的,Min_Rooms列表中的每个值都是2020-04-01 +0天,+ 1天.. +6天相同Set + Type + Category组合的房间的最小值。因此,总共7天,包括初始日期本身。 但是由于(Set1 | Type1 | Category1)在数据框中只有6天可用,所以从01 / APR到05 / APR,列表中的最后一个值为0。 对于第二行,该值从02 / APR开始,由于从该日期开始,df中仅5天,最后两个值是0,依此类推。

|    | Dat        | Set   | Type   | Category   |   Rooms | Min_Rooms             |
|---:|:-----------|:------|:-------|:-----------|--------:|:----------------------|
|  0 | 2020-04-01 | Set1  | Type1  | Category1  |       6 | [6, 5, 4, 4, 2, 2, 0] |
|  1 | 2020-04-02 | Set1  | Type1  | Category1  |       5 | [5, 4, 4, 2, 2, 0, 0] |
|  2 | 2020-04-03 | Set1  | Type1  | Category1  |       4 | [4, 4, 2, 2, 0, 0, 0] |
|  3 | 2020-04-04 | Set1  | Type1  | Category1  |       7 | [7, 2, 2, 0, 0, 0, 0] |
|  4 | 2020-04-05 | Set1  | Type1  | Category1  |       2 | [2, 2, 0, 0, 0, 0, 0] |
|  5 | 2020-04-06 | Set1  | Type1  | Category1  |       9 | [9, 0, 0, 0, 0, 0, 0] |
|  6 | 2020-04-01 | Set2  | Type1  | Category2  |       3 | [3, 3, 1, 0, 0, 0, 0] |
|  7 | 2020-04-02 | Set2  | Type1  | Category2  |       5 | [5, 1, 0, 0, 0, 0, 0] |
|  8 | 2020-04-03 | Set2  | Type1  | Category2  |       1 | [1, 0, 0, 0, 0, 0, 0] |

1 个答案:

答案 0 :(得分:0)

这是一种做您要寻找的东西的方法。

  • 我正在使用rolling创建过去N天的可用客房列表,然后使用“ shift”将其向后移,以便每个日期都可以查看将来的空房情况。
  • 为方便起见,我使用NUM_DAYS =7。为清楚起见,我还将其分为几个步骤。
  • 要计算将来日期的可用房间,我要做的第一件事是向数据框添加具有零可用性的虚拟日期。我最后要删除这些虚拟日期。
    NUM_DAYS = 7
    
    df["Date"] = pd.to_datetime(df.Date)
    
    #New records is a set of future dummy dates with 0 availability. 
    new_records = df.reset_index().groupby(["Set", "Type", "Category"], as_index = False)["Date"].max()
    
    new_records["Date"] = new_records.Date.apply(lambda d: pd.date_range(d + datetime.timedelta(days = 1), 
                                                                         periods=NUM_DAYS, freq="1D"))
    new_records = new_records.explode("Date")
    new_records["dummy"] = True
    new_records["Rooms"] = 0
    
    # expand the data to include these dummy dates. 
    df = pd.concat([df, new_records], axis=0)
    
    # calculate the minimum availability on any given date. 
    df["min_per_day"] = df.groupby(["Set", "Type", "Category", "Date"])["Rooms"].transform(min)
    
    df.set_index("Date", inplace = True)
        
    list_of_lists=[]
    def add_to_lists(s):
        list_of_lists.append(list(s))
        return 0
        
    # Build rolling lists of N days. Since 'rolling' can't work with 
    # lists, use 'list_of_lists' to maintain the results, and then add 
    # them as a new column. 
    df.groupby(["Set", "Type", "Category"]).rolling(str(NUM_DAYS) + "D", 
                                                    min_period = 1)["min_per_day"].apply(add_to_lists, raw=True)
    df.sort_values(["Set", "Type", "Category"], inplace = True)
    df["past_mins"] = list_of_lists
    
    # Shift the results by N days - now, every date looks at the 
    # future rather than at the past. 
    df["future_mins"] = df.groupby(["Set", "Type", "Category"])["past_mins"].shift((-(NUM_DAYS-1)))
    
    # Drop dummy dates and irrelevant data. 
    df = df[df.dummy.isna()]
    df.drop(["dummy", "min_per_day", "past_mins"], axis=1, inplace=True)

结果是:

             Set   Type   Category  Rooms                          future_mins
Date                                                                          
2020-04-01  Set1  Type1  Category1      6  [6.0, 5.0, 4.0, 3.0, 2.0, 1.0, 0.0]
2020-04-02  Set1  Type1  Category1      5  [5.0, 4.0, 3.0, 2.0, 1.0, 0.0, 0.0]
2020-04-03  Set1  Type1  Category1      4  [4.0, 3.0, 2.0, 1.0, 0.0, 0.0, 0.0]
2020-04-04  Set1  Type1  Category1      3  [3.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0]
2020-04-05  Set1  Type1  Category1      2  [2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2020-04-06  Set1  Type1  Category1      1  [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2020-04-01  Set2  Type1  Category2      3  [3.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0]
2020-04-02  Set2  Type1  Category2      2  [2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2020-04-03  Set2  Type1  Category2      1  [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]