Question

我有以下情况：

一个数据框，显示每个产品和商店的每个库存移动（买/卖）。

        date     sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
2  2019-10-04  103993.0    001    1.0      8.0
3  2019-10-05  103993.0    001    0.0      8.0
4  2019-10-01  103994.0    002    0.0     12.0
5  2019-10-02  103994.0    002    1.0     11.0
6  2019-10-04  103994.0    002    1.0     10.0
7  2019-10-05  103994.0    002    0.0     10.0
8  2019-09-30  103991.0    012    0.0     12.0
9  2019-10-02  103991.0    012    1.0     11.0
10 2019-10-04  103991.0    012    1.0     10.0
11 2019-10-05  103991.0    012    0.0     10.0

每种产品都有不同的开始日期，但是，我想将它们的结束日期都设为同一日期。

假设今天是2019年10月8日，我想更新此数据框，在第一个日期到2019年10月8日之间被跳过的日期插入行。

示例：

Sku 103993
商店：001
首次约会：2019-10-01（这将是第一个索引）
结束日期：2019-10-08

数据框：

        date     sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
2  2019-10-04  103993.0    001    1.0      8.0
3  2019-10-05  103993.0    001    0.0      8.0

预期输出应为：

        date     sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
1  2019-10-03  103993.0    001    NaN      NaN
2  2019-10-04  103993.0    001    1.0      8.0
3  2019-10-05  103993.0    001    0.0      8.0
4  2019-10-06  103993.0    001    NaN      NaN
5  2019-10-07  103993.0    001    NaN      NaN
6  2019-10-08  103993.0    001    NaN      NaN

为此，我提出了两种解决方案：

  dfs = []
    for _, d in df.groupby(['sku', 'store']):

        start_date = d.date.iloc[0]
        end_date = pd.Timestamp('2019-10-08')

        d.set_index('date', inplace=True)
        d = d.reindex(pd.date_range(start_date, end_date))
        dfs.append(d)

    df = pd.concat(dfs)

随后：

v = '2019-10-08'

df = df.groupby(['sku', 'store'])['date', 'Units', 'balance']  \
    .apply(lambda x: x.set_index('date')  \
    .reindex(pd.date_range(x.date.iloc[0], pd.Timestamp(v)))

但是，当我拥有一个包含100000种产品的数据框时，会花费太长时间。

你们是否有任何想法来改进此功能（通过熊猫进行矢量化处理）？

Answer 1

您可以使用熊猫合并（或联接）操作来完成所有这些操作。当您拥有许多具有不同“总计”日期（从数据框的最小日期到现在）的“产品”（“ sku”，“商店”组合）时，这种方法可能会出现问题。

以下内容假设您的数据位于df中。

# For convenience some variables:
END_DATE = datetime.date(2019, 10, 10)
product_columns = ['sku', 'store']
minimum_date = df['date'].min()
product_date_columns = product_columns + ['date']

# We will first save away the minimum date of for each product for later
minimum_date_per_product = df[product_date_columns].groupby(product_columns).agg('min')
minimum_date_per_product = minimum_date_per_product.rename({'date': 'minimum_date'}, axis=1)

# Then you find all possible product/date combinations, as said above, this might lead 
# to a huge dataframe (of size len(unique_products) times len(unique_dates)):
all_dates = pd.DataFrame(index=pd.date_range(minimum_date, END_DATE)).reset_index()
all_dates = all_dates.rename({'index': 'date'}, axis=1)
all_products = df[product_columns].drop_duplicates()
all_dates['key'] = 0
all_products['key'] = 0
all_product_date_combinations = pd.merge(all_dates, all_products, on='key').drop('key', axis=1)

# You then create all possible selling dates for your products
df = df.set_index(product_date_columns)
all_product_date_combinations = all_product_date_combinations.set_index(product_date_columns)
df = df.join(all_product_date_combinations, how='right')

# Now you only have to drop all rows that are before the first starting date of a product
df = df.join(minimum_date_per_product).reset_index()
df = df[df['date'] >= df['minimum_date']]
df = df.drop('minimum_date', axis=1)

对于您提供的输入数据，输出看起来像这样：

         sku  store       date  Units  balance
0   103991.0     12 2019-09-30    0.0     12.0
1   103991.0     12 2019-10-01    NaN      NaN
2   103991.0     12 2019-10-02    1.0     11.0
3   103991.0     12 2019-10-03    NaN      NaN
4   103991.0     12 2019-10-04    1.0     10.0
5   103991.0     12 2019-10-05    0.0     10.0
6   103991.0     12 2019-10-06    NaN      NaN
7   103991.0     12 2019-10-07    NaN      NaN
8   103991.0     12 2019-10-08    NaN      NaN
9   103991.0     12 2019-10-09    NaN      NaN
10  103991.0     12 2019-10-10    NaN      NaN
12  103993.0      1 2019-10-01    0.0     10.0
13  103993.0      1 2019-10-02    1.0      9.0
14  103993.0      1 2019-10-03    NaN      NaN
15  103993.0      1 2019-10-04    1.0      8.0
16  103993.0      1 2019-10-05    0.0      8.0
17  103993.0      1 2019-10-06    NaN      NaN
18  103993.0      1 2019-10-07    NaN      NaN
19  103993.0      1 2019-10-08    NaN      NaN
20  103993.0      1 2019-10-09    NaN      NaN
21  103993.0      1 2019-10-10    NaN      NaN
23  103994.0      2 2019-10-01    0.0     12.0
24  103994.0      2 2019-10-02    1.0     11.0
25  103994.0      2 2019-10-03    NaN      NaN
26  103994.0      2 2019-10-04    1.0     10.0
27  103994.0      2 2019-10-05    0.0     10.0
28  103994.0      2 2019-10-06    NaN      NaN
29  103994.0      2 2019-10-07    NaN      NaN
30  103994.0      2 2019-10-08    NaN      NaN
31  103994.0      2 2019-10-09    NaN      NaN
32  103994.0      2 2019-10-10    NaN      NaN

Answer 2

如果我正确理解，这就是您要尝试做的事情。这个可能更快，因为您不会重复连接和附加整个DF。真的不确定。您必须对其进行测试。

print(df)
print("--------------")
import pandas as pd
import numpy as np 

def Insert_row(row_number, df, row_value):
    """
    from here: https://www.geeksforgeeks.org/insert-row-at-given-position-in-pandas-dataframe/
    """ 
    # Starting value of upper half 
    start_upper = 0
    # End value of upper half 
    end_upper = row_number 
    # Start value of lower half 
    start_lower = row_number 
    # End value of lower half 
    end_lower = df.shape[0] 
    # Create a list of upper_half index 
    upper_half = [*range(start_upper, end_upper, 1)] 
    # Create a list of lower_half index 
    lower_half = [*range(start_lower, end_lower, 1)] 
    # Increment the value of lower half by 1 
    lower_half = [x.__add__(1) for x in lower_half] 
    # Combine the two lists 
    index_ = upper_half + lower_half 
    # Update the index of the dataframe 
    df.index = index_ 
    # Insert a row at the end 
    df.loc[row_number] = row_value 
    # Sort the index labels 
    df = df.sort_index() 
    # return the dataframe 
    return df 

# First ensure the column is datetime values
df["date"] = pd.to_datetime(df["date"])

location = 1 # Start at the SECOND row
for i in range (1, df.shape[0], 1): # Loop through all the rows
    current_date  = df.iloc[location]["date"] # Date of the current row
    previous_date = df.iloc[location - 1]["date"] # Date of the previous row
    try: # Try to get a difference between the row's dates
        difference = int((current_date - previous_date) / np.timedelta64(1, 'D') )
    except ValueError as e: 
        if "NaN" in str(e).lower(): 
            continue
#    print(previous_date, " - ", current_date, "=", difference)
    if difference > 1: # If the difference is more than one day
        newdate = (pd.to_datetime(previous_date) + np.timedelta64(1, "D")) # Increment the date by one day        
        for d in range(1, difference, 1): # Loop for all missing rows
#            print("Inserting row with date {}".format(newdate))
            row_value = [newdate, np.nan, np.nan, np.nan, np.nan] # Create the row
            df = Insert_row(location, df, row_value) # Insert the row
            location += 1 # Increment the location
            newdate = (pd.to_datetime(newdate) + np.timedelta64(1, "D")) # Increment the date for the next loop if it's needed- 
    location += 1 

print(df)

输出：

date       sku  store  Units  balance
0  2019-10-01  103993.0    1.0    0.0     10.0
1  2019-10-02  103993.0    1.0    1.0      9.0
2  2019-10-04  103993.0    1.0    1.0      8.0
3  2019-10-05  103993.0    1.0    0.0      8.0
4  2019-10-06  103994.0    2.0    0.0     12.0
5  2019-10-07  103994.0    2.0    1.0     11.0
6  2019-10-10  103994.0    2.0    1.0     10.0
7  2019-10-15  103994.0    2.0    0.0     10.0
8  2019-10-30  103991.0   12.0    0.0     12.0
9                   NaN    NaN    NaN      NaN
--------------
date       sku  store  Units  balance
0  2019-10-01  103993.0    1.0    0.0     10.0
1  2019-10-02  103993.0    1.0    1.0      9.0
2  2019-10-03       NaN    NaN    NaN      NaN
3  2019-10-04  103993.0    1.0    1.0      8.0
4  2019-10-05  103993.0    1.0    0.0      8.0
5  2019-10-06  103994.0    2.0    0.0     12.0
6  2019-10-07  103994.0    2.0    1.0     11.0
7  2019-10-08       NaN    NaN    NaN      NaN
8  2019-10-09       NaN    NaN    NaN      NaN
9  2019-10-10  103994.0    2.0    1.0     10.0
10 2019-10-11       NaN    NaN    NaN      NaN
11 2019-10-12       NaN    NaN    NaN      NaN
12 2019-10-13       NaN    NaN    NaN      NaN
13 2019-10-14       NaN    NaN    NaN      NaN
14 2019-10-15  103994.0    2.0    0.0     10.0
15 2019-10-16       NaN    NaN    NaN      NaN
16 2019-10-17       NaN    NaN    NaN      NaN
17 2019-10-18       NaN    NaN    NaN      NaN
18 2019-10-19       NaN    NaN    NaN      NaN
19 2019-10-20       NaN    NaN    NaN      NaN
20 2019-10-21       NaN    NaN    NaN      NaN
21 2019-10-22       NaN    NaN    NaN      NaN
22 2019-10-23       NaN    NaN    NaN      NaN
23 2019-10-24       NaN    NaN    NaN      NaN
24 2019-10-25       NaN    NaN    NaN      NaN
25 2019-10-26       NaN    NaN    NaN      NaN
26 2019-10-27       NaN    NaN    NaN      NaN
27 2019-10-28       NaN    NaN    NaN      NaN
28 2019-10-29       NaN    NaN    NaN      NaN
29 2019-10-30  103991.0   12.0    0.0     12.0
30 2019-10-31       NaN    NaN    NaN      NaN
31 2019-11-01       NaN    NaN    NaN      NaN
32 2019-11-02       NaN    NaN    NaN      NaN
33 2019-11-03       NaN    NaN    NaN      NaN
34 2019-11-04       NaN    NaN    NaN      NaN
35 2019-11-05       NaN    NaN    NaN      NaN
36 2019-11-06       NaN    NaN    NaN      NaN
37 2019-11-07       NaN    NaN    NaN      NaN
38 2019-11-08       NaN    NaN    NaN      NaN
39 2019-11-09       NaN    NaN    NaN      NaN
40 2019-11-10       NaN    NaN    NaN      NaN
41 2019-11-11       NaN    NaN    NaN      NaN
42 2019-11-12       NaN    NaN    NaN      NaN
43 2019-11-13       NaN    NaN    NaN      NaN
44        NaT       NaN    NaN    NaN      NaN

优化熊猫重新索引date_range

2 个答案: