我是熊猫新人,我正在尝试解决下一个问题。
我有一个数据集(最初大约是2百万行)
test = pd.DataFrame({ 'Date' : ['2020-04-01','2020-04-02','2020-04-03',
'2020-04-04','2020-04-05','2020-04-06',
'2020-04-01','2020-04-02','2020-04-03'],
'Set' : ['Set1','Set1','Set1','Set1','Set1','Set1',
'Set2','Set2','Set2'],
'Type': ['Type1','Type1','Type1','Type1','Type1','Type1',
'Type1','Type1','Type1'],
'Category': ['Category1','Category1','Category1','Category1','Category1','Category1',
'Category2','Category2','Category2'],
'Rooms' : [6,5,4,7,2,9,3,5,1]
})
我需要创建一个新列,每行将包含21个值的列表。
列表的每个值对应于每个(设置-类型-类别-日期)组合在接下来1到20天的最小房间数。
例如第一行包含[2020-04-01,Set1,Type1,Category1,6] 我需要遍历整个df,查找包含相同[Set1,Type1,Category1]的所有行,在此过滤后的部分中,我需要找到以下房间的最小值: 2020-04-01 + 0天,2020-04-01 + 1天,2020-04-01 + 2天... 2020-04-01 + 20天。
我想出了这段代码,它可以工作,但仅在df的一小部分上有效。当我尝试在整个df上运行时,它花费的时间是无限的。我确信可以使用groupby对其进行优化,但是我仍然无法正确使用它。
for i in range(len(test)): #iterate through entire dataframe
x=[] #create a list that will be added to the new column at the end
current_set = test.loc[ #filter out needed part of the dataframe
(test.Set== test.Set.loc[i]) &
(test.Type == test.Type.loc[i]) &
(test.Category == test.Category.loc[i]) &
(test.Date >= test.loc[i,'Date']) &
(test.Date <= test.loc[i,'Date']+pd.Timedelta(days=20))
]
for n in range(1,22): #run internal loop to fill out the x list by additionally filtering
if len(current_set)>=n: #check if there is enough days after current date
c_date = test.Date < test.Date.loc[i]+pd.Timedelta(days=n) #if True filter current_set to needed state
x.append(min(current_room.loc[c_date,'Rooms']))
else:
x.append(0)
test.at[i,'Min_Rooms']=x #add generated list to new column
整个代码从一个单元格运行
test = pd.DataFrame({ 'Date' : ['2020-04-01','2020-04-02','2020-04-03',
'2020-04-04','2020-04-05','2020-04-06',
'2020-04-01','2020-04-02','2020-04-03'],
'Set' : ['Set1','Set1','Set1','Set1','Set1','Set1',
'Set2','Set2','Set2'],
'Type': ['Type1','Type1','Type1','Type1','Type1','Type1',
'Type1','Type1','Type1'],
'Category': ['Category1','Category1','Category1','Category1','Category1','Category1',
'Category2','Category2','Category2'],
'Rooms' : [6,5,4,7,2,9,3,5,1]
})
# Convert 'Date' to daetime
test['Date'] = pd.to_datetime(test['Date'], format= '%Y/%m/%d')
# Create new column
test.at[0,'Min_Rooms'] = 1
#Convert to object type in order to insert lists
test['Min_Rooms'] = test['Min_Rooms'].astype(object)
# Start the loop
for i in range(len(test)):
x=[]
current_room = test.loc[
(test.Set == test.Set.loc[i]) &
(test.Type == test.Type.loc[i]) &
(test.Category == test.Category.loc[i]) &
(test.Date >= test.loc[i,'Date']) &
(test.Date <= test.loc[i,'Date']+pd.Timedelta(days=20))
]
for n in range(1,22):
if len(current_room)>=n:
c_date = test.Date < test.Date.loc[i]+pd.Timedelta(days=n)
x.append(min(current_room.loc[c_date,'Rooms']))
else:
x.append(0)
test.at[i,'Min_Rooms']=x
print(test.to_markdown())
我的天数从21天减少到7天,这是我的结果。 如您在第一行中看到的,Min_Rooms列表中的每个值都是2020-04-01 +0天,+ 1天.. +6天相同Set + Type + Category组合的房间的最小值。因此,总共7天,包括初始日期本身。 但是由于(Set1 | Type1 | Category1)在数据框中只有6天可用,所以从01 / APR到05 / APR,列表中的最后一个值为0。 对于第二行,该值从02 / APR开始,由于从该日期开始,df中仅5天,最后两个值是0,依此类推。
| | Dat | Set | Type | Category | Rooms | Min_Rooms |
|---:|:-----------|:------|:-------|:-----------|--------:|:----------------------|
| 0 | 2020-04-01 | Set1 | Type1 | Category1 | 6 | [6, 5, 4, 4, 2, 2, 0] |
| 1 | 2020-04-02 | Set1 | Type1 | Category1 | 5 | [5, 4, 4, 2, 2, 0, 0] |
| 2 | 2020-04-03 | Set1 | Type1 | Category1 | 4 | [4, 4, 2, 2, 0, 0, 0] |
| 3 | 2020-04-04 | Set1 | Type1 | Category1 | 7 | [7, 2, 2, 0, 0, 0, 0] |
| 4 | 2020-04-05 | Set1 | Type1 | Category1 | 2 | [2, 2, 0, 0, 0, 0, 0] |
| 5 | 2020-04-06 | Set1 | Type1 | Category1 | 9 | [9, 0, 0, 0, 0, 0, 0] |
| 6 | 2020-04-01 | Set2 | Type1 | Category2 | 3 | [3, 3, 1, 0, 0, 0, 0] |
| 7 | 2020-04-02 | Set2 | Type1 | Category2 | 5 | [5, 1, 0, 0, 0, 0, 0] |
| 8 | 2020-04-03 | Set2 | Type1 | Category2 | 1 | [1, 0, 0, 0, 0, 0, 0] |
答案 0 :(得分:0)
这是一种做您要寻找的东西的方法。
rolling
创建过去N天的可用客房列表,然后使用“ shift”将其向后移,以便每个日期都可以查看将来的空房情况。 NUM_DAYS = 7
df["Date"] = pd.to_datetime(df.Date)
#New records is a set of future dummy dates with 0 availability.
new_records = df.reset_index().groupby(["Set", "Type", "Category"], as_index = False)["Date"].max()
new_records["Date"] = new_records.Date.apply(lambda d: pd.date_range(d + datetime.timedelta(days = 1),
periods=NUM_DAYS, freq="1D"))
new_records = new_records.explode("Date")
new_records["dummy"] = True
new_records["Rooms"] = 0
# expand the data to include these dummy dates.
df = pd.concat([df, new_records], axis=0)
# calculate the minimum availability on any given date.
df["min_per_day"] = df.groupby(["Set", "Type", "Category", "Date"])["Rooms"].transform(min)
df.set_index("Date", inplace = True)
list_of_lists=[]
def add_to_lists(s):
list_of_lists.append(list(s))
return 0
# Build rolling lists of N days. Since 'rolling' can't work with
# lists, use 'list_of_lists' to maintain the results, and then add
# them as a new column.
df.groupby(["Set", "Type", "Category"]).rolling(str(NUM_DAYS) + "D",
min_period = 1)["min_per_day"].apply(add_to_lists, raw=True)
df.sort_values(["Set", "Type", "Category"], inplace = True)
df["past_mins"] = list_of_lists
# Shift the results by N days - now, every date looks at the
# future rather than at the past.
df["future_mins"] = df.groupby(["Set", "Type", "Category"])["past_mins"].shift((-(NUM_DAYS-1)))
# Drop dummy dates and irrelevant data.
df = df[df.dummy.isna()]
df.drop(["dummy", "min_per_day", "past_mins"], axis=1, inplace=True)
结果是:
Set Type Category Rooms future_mins
Date
2020-04-01 Set1 Type1 Category1 6 [6.0, 5.0, 4.0, 3.0, 2.0, 1.0, 0.0]
2020-04-02 Set1 Type1 Category1 5 [5.0, 4.0, 3.0, 2.0, 1.0, 0.0, 0.0]
2020-04-03 Set1 Type1 Category1 4 [4.0, 3.0, 2.0, 1.0, 0.0, 0.0, 0.0]
2020-04-04 Set1 Type1 Category1 3 [3.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0]
2020-04-05 Set1 Type1 Category1 2 [2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2020-04-06 Set1 Type1 Category1 1 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2020-04-01 Set2 Type1 Category2 3 [3.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0]
2020-04-02 Set2 Type1 Category2 2 [2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2020-04-03 Set2 Type1 Category2 1 [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]