我有以下情况:
一个数据框,显示每个产品和商店的每个库存移动(买/卖)。
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
2 2019-10-04 103993.0 001 1.0 8.0
3 2019-10-05 103993.0 001 0.0 8.0
4 2019-10-01 103994.0 002 0.0 12.0
5 2019-10-02 103994.0 002 1.0 11.0
6 2019-10-04 103994.0 002 1.0 10.0
7 2019-10-05 103994.0 002 0.0 10.0
8 2019-09-30 103991.0 012 0.0 12.0
9 2019-10-02 103991.0 012 1.0 11.0
10 2019-10-04 103991.0 012 1.0 10.0
11 2019-10-05 103991.0 012 0.0 10.0
每种产品都有不同的开始日期,但是,我想将它们的结束日期都设为同一日期。
假设今天是2019年10月8日,我想更新此数据框,在第一个日期到2019年10月8日之间被跳过的日期插入行。
示例:
数据框:
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
2 2019-10-04 103993.0 001 1.0 8.0
3 2019-10-05 103993.0 001 0.0 8.0
预期输出应为:
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
1 2019-10-03 103993.0 001 NaN NaN
2 2019-10-04 103993.0 001 1.0 8.0
3 2019-10-05 103993.0 001 0.0 8.0
4 2019-10-06 103993.0 001 NaN NaN
5 2019-10-07 103993.0 001 NaN NaN
6 2019-10-08 103993.0 001 NaN NaN
为此,我提出了两种解决方案:
dfs = []
for _, d in df.groupby(['sku', 'store']):
start_date = d.date.iloc[0]
end_date = pd.Timestamp('2019-10-08')
d.set_index('date', inplace=True)
d = d.reindex(pd.date_range(start_date, end_date))
dfs.append(d)
df = pd.concat(dfs)
随后:
v = '2019-10-08'
df = df.groupby(['sku', 'store'])['date', 'Units', 'balance'] \
.apply(lambda x: x.set_index('date') \
.reindex(pd.date_range(x.date.iloc[0], pd.Timestamp(v)))
但是,当我拥有一个包含100000种产品的数据框时,会花费太长时间。
你们是否有任何想法来改进此功能(通过熊猫进行矢量化处理)?
答案 0 :(得分:1)
您可以使用熊猫合并(或联接)操作来完成所有这些操作。当您拥有许多具有不同“总计”日期(从数据框的最小日期到现在)的“产品”(“ sku”,“商店”组合)时,这种方法可能会出现问题。
以下内容假设您的数据位于df
中。
# For convenience some variables:
END_DATE = datetime.date(2019, 10, 10)
product_columns = ['sku', 'store']
minimum_date = df['date'].min()
product_date_columns = product_columns + ['date']
# We will first save away the minimum date of for each product for later
minimum_date_per_product = df[product_date_columns].groupby(product_columns).agg('min')
minimum_date_per_product = minimum_date_per_product.rename({'date': 'minimum_date'}, axis=1)
# Then you find all possible product/date combinations, as said above, this might lead
# to a huge dataframe (of size len(unique_products) times len(unique_dates)):
all_dates = pd.DataFrame(index=pd.date_range(minimum_date, END_DATE)).reset_index()
all_dates = all_dates.rename({'index': 'date'}, axis=1)
all_products = df[product_columns].drop_duplicates()
all_dates['key'] = 0
all_products['key'] = 0
all_product_date_combinations = pd.merge(all_dates, all_products, on='key').drop('key', axis=1)
# You then create all possible selling dates for your products
df = df.set_index(product_date_columns)
all_product_date_combinations = all_product_date_combinations.set_index(product_date_columns)
df = df.join(all_product_date_combinations, how='right')
# Now you only have to drop all rows that are before the first starting date of a product
df = df.join(minimum_date_per_product).reset_index()
df = df[df['date'] >= df['minimum_date']]
df = df.drop('minimum_date', axis=1)
对于您提供的输入数据,输出看起来像这样:
sku store date Units balance
0 103991.0 12 2019-09-30 0.0 12.0
1 103991.0 12 2019-10-01 NaN NaN
2 103991.0 12 2019-10-02 1.0 11.0
3 103991.0 12 2019-10-03 NaN NaN
4 103991.0 12 2019-10-04 1.0 10.0
5 103991.0 12 2019-10-05 0.0 10.0
6 103991.0 12 2019-10-06 NaN NaN
7 103991.0 12 2019-10-07 NaN NaN
8 103991.0 12 2019-10-08 NaN NaN
9 103991.0 12 2019-10-09 NaN NaN
10 103991.0 12 2019-10-10 NaN NaN
12 103993.0 1 2019-10-01 0.0 10.0
13 103993.0 1 2019-10-02 1.0 9.0
14 103993.0 1 2019-10-03 NaN NaN
15 103993.0 1 2019-10-04 1.0 8.0
16 103993.0 1 2019-10-05 0.0 8.0
17 103993.0 1 2019-10-06 NaN NaN
18 103993.0 1 2019-10-07 NaN NaN
19 103993.0 1 2019-10-08 NaN NaN
20 103993.0 1 2019-10-09 NaN NaN
21 103993.0 1 2019-10-10 NaN NaN
23 103994.0 2 2019-10-01 0.0 12.0
24 103994.0 2 2019-10-02 1.0 11.0
25 103994.0 2 2019-10-03 NaN NaN
26 103994.0 2 2019-10-04 1.0 10.0
27 103994.0 2 2019-10-05 0.0 10.0
28 103994.0 2 2019-10-06 NaN NaN
29 103994.0 2 2019-10-07 NaN NaN
30 103994.0 2 2019-10-08 NaN NaN
31 103994.0 2 2019-10-09 NaN NaN
32 103994.0 2 2019-10-10 NaN NaN
答案 1 :(得分:0)
如果我正确理解,这就是您要尝试做的事情。这个可能更快,因为您不会重复连接和附加整个DF。真的不确定。您必须对其进行测试。
print(df)
print("--------------")
import pandas as pd
import numpy as np
def Insert_row(row_number, df, row_value):
"""
from here: https://www.geeksforgeeks.org/insert-row-at-given-position-in-pandas-dataframe/
"""
# Starting value of upper half
start_upper = 0
# End value of upper half
end_upper = row_number
# Start value of lower half
start_lower = row_number
# End value of lower half
end_lower = df.shape[0]
# Create a list of upper_half index
upper_half = [*range(start_upper, end_upper, 1)]
# Create a list of lower_half index
lower_half = [*range(start_lower, end_lower, 1)]
# Increment the value of lower half by 1
lower_half = [x.__add__(1) for x in lower_half]
# Combine the two lists
index_ = upper_half + lower_half
# Update the index of the dataframe
df.index = index_
# Insert a row at the end
df.loc[row_number] = row_value
# Sort the index labels
df = df.sort_index()
# return the dataframe
return df
# First ensure the column is datetime values
df["date"] = pd.to_datetime(df["date"])
location = 1 # Start at the SECOND row
for i in range (1, df.shape[0], 1): # Loop through all the rows
current_date = df.iloc[location]["date"] # Date of the current row
previous_date = df.iloc[location - 1]["date"] # Date of the previous row
try: # Try to get a difference between the row's dates
difference = int((current_date - previous_date) / np.timedelta64(1, 'D') )
except ValueError as e:
if "NaN" in str(e).lower():
continue
# print(previous_date, " - ", current_date, "=", difference)
if difference > 1: # If the difference is more than one day
newdate = (pd.to_datetime(previous_date) + np.timedelta64(1, "D")) # Increment the date by one day
for d in range(1, difference, 1): # Loop for all missing rows
# print("Inserting row with date {}".format(newdate))
row_value = [newdate, np.nan, np.nan, np.nan, np.nan] # Create the row
df = Insert_row(location, df, row_value) # Insert the row
location += 1 # Increment the location
newdate = (pd.to_datetime(newdate) + np.timedelta64(1, "D")) # Increment the date for the next loop if it's needed-
location += 1
print(df)
输出:
date sku store Units balance
0 2019-10-01 103993.0 1.0 0.0 10.0
1 2019-10-02 103993.0 1.0 1.0 9.0
2 2019-10-04 103993.0 1.0 1.0 8.0
3 2019-10-05 103993.0 1.0 0.0 8.0
4 2019-10-06 103994.0 2.0 0.0 12.0
5 2019-10-07 103994.0 2.0 1.0 11.0
6 2019-10-10 103994.0 2.0 1.0 10.0
7 2019-10-15 103994.0 2.0 0.0 10.0
8 2019-10-30 103991.0 12.0 0.0 12.0
9 NaN NaN NaN NaN
--------------
date sku store Units balance
0 2019-10-01 103993.0 1.0 0.0 10.0
1 2019-10-02 103993.0 1.0 1.0 9.0
2 2019-10-03 NaN NaN NaN NaN
3 2019-10-04 103993.0 1.0 1.0 8.0
4 2019-10-05 103993.0 1.0 0.0 8.0
5 2019-10-06 103994.0 2.0 0.0 12.0
6 2019-10-07 103994.0 2.0 1.0 11.0
7 2019-10-08 NaN NaN NaN NaN
8 2019-10-09 NaN NaN NaN NaN
9 2019-10-10 103994.0 2.0 1.0 10.0
10 2019-10-11 NaN NaN NaN NaN
11 2019-10-12 NaN NaN NaN NaN
12 2019-10-13 NaN NaN NaN NaN
13 2019-10-14 NaN NaN NaN NaN
14 2019-10-15 103994.0 2.0 0.0 10.0
15 2019-10-16 NaN NaN NaN NaN
16 2019-10-17 NaN NaN NaN NaN
17 2019-10-18 NaN NaN NaN NaN
18 2019-10-19 NaN NaN NaN NaN
19 2019-10-20 NaN NaN NaN NaN
20 2019-10-21 NaN NaN NaN NaN
21 2019-10-22 NaN NaN NaN NaN
22 2019-10-23 NaN NaN NaN NaN
23 2019-10-24 NaN NaN NaN NaN
24 2019-10-25 NaN NaN NaN NaN
25 2019-10-26 NaN NaN NaN NaN
26 2019-10-27 NaN NaN NaN NaN
27 2019-10-28 NaN NaN NaN NaN
28 2019-10-29 NaN NaN NaN NaN
29 2019-10-30 103991.0 12.0 0.0 12.0
30 2019-10-31 NaN NaN NaN NaN
31 2019-11-01 NaN NaN NaN NaN
32 2019-11-02 NaN NaN NaN NaN
33 2019-11-03 NaN NaN NaN NaN
34 2019-11-04 NaN NaN NaN NaN
35 2019-11-05 NaN NaN NaN NaN
36 2019-11-06 NaN NaN NaN NaN
37 2019-11-07 NaN NaN NaN NaN
38 2019-11-08 NaN NaN NaN NaN
39 2019-11-09 NaN NaN NaN NaN
40 2019-11-10 NaN NaN NaN NaN
41 2019-11-11 NaN NaN NaN NaN
42 2019-11-12 NaN NaN NaN NaN
43 2019-11-13 NaN NaN NaN NaN
44 NaT NaN NaN NaN NaN