Python数据框循环操作效率

时间:2018-11-08 14:56:12

标签: python pandas performance loops dataframe

我有两个输入,一个数据框包含所有产品的现有库存,一个数据框列表,其中每个数据框包含多个产品的订单。订单数据框可以包含同一产品的多个订单,并且这些订单可以大于现有的可用库存。

样本输入数据:

包含订单的数据框列表

[    prodID    orderQTY
 0   ABC1      456
 1   ABC2      703
 2   ABC3      359
 3   ABC4      492
 4   ABC5      824
 ..  ...       ...
 [100 rows x 2 columns],
     prodID    orderQTY
 0   ABC10     805
 1   ABC11     860
 2   ABC12     651
 3   ABC13     662
 4   ABC14     802
 ..  ...       ...
 [100 rows x 2 columns],
     prodID    orderQTY
 0   ABC100    111
 1   ABC101    834
 2   ABC102    302
 3   ABC103    386
 4   ABC104    318
 ..  ...       ...
 [100 rows x 2 columns]]

现有库存的数据框

    prodID    onHand
0   ABC1      37200
1   ABC2      38945
2   ABC3      38085
3   ABC4      43775
4   ABC5      10999
..  ...       ...
 [300 rows x 2 columns]

我编写了一个函数来遍历列表中每个数据框内的订单并更新现有库存。然后,该函数将输出包含所有订单的数据框,其中显示了开始的库存,订单金额和结束的库存。

功能:

def fulfill_orders(order_list, stock):
# create empty dataframe where the results will go
combined = pd.DataFrame(columns=['prodID', 'startSTOCK', 'orderQTY', 'endSTOCK'])

# loop through dataframes in order_list
for orders in order_list:
    # create unique list of product ids
    ids = orders['prodID'].unique()
    # loop through unique list of product ids
    for id in ids:
        # create dataframe that consists of a single product id
        df = orders.groupby('prodID').get_group(id)
        # merge stock on hand onto single product id dataframe
        df = df.merge(stock, how='left', on='prodID')
        # rename stock on hand column
        df.rename(columns={'onHAND': 'startSTOCK'}, inplace=True)
        # create ending stock column and calculate value for first row
        df.loc[0,'endSTOCK'] = df.loc[0, 'startSTOCK'] - df.loc[0, 'orderQTY']
        # check if ending stock for first row is less than zero
        if df.loc[0, 'endSTOCK'] < 0:
            # reset negative first row ending stock to starting stock
            df.loc[0, 'endSTOCK'] = df.loc[0, 'startSTOCK']
        # loop through remaining rows
        for i in range(1, len(df)):
            # set starting stock to previous ending stock
            df.loc[i, 'startSTOCK'] = df.loc[i-1, 'endSTOCK']
            # calculate new ending stock
            df.loc[i, 'endSTOCK'] = df.loc[i, 'startSTOCK'] - df.loc[i, 'orderQTY']
            # check if ending stock is less than zero
            if df.loc[i, 'endSTOCK'] < 0:
                # reset negative ending stock to starting stock
                df.loc[i, 'endSTOCK'] = df.loc[i, 'startSTOCK']
        # append results to output dataframe
        combined = combined.append(df, ignore_index=True)
        # remove extra columns
        df = df[['prodID', 'endSTOCK']]
        # rename columns
        df.rename(columns={'endSTOCK': 'onHAND'}, inplace=True)
        # keep only last value
        df = df.iloc[[-1]]
        # update the stock on hand dataframe
        stock = pd.concat([stock, df]).drop_duplicates(['prodID'], keep='last')

# change column types to integer
combined['startSTOCK'] = combined.startSTOCK.astype(int)
combined['endSTOCK'] = combined.endSTOCK.astype(int)

# reorder columns
combined = combined[['prodID', 'startSTOCK', 'orderQTY', 'endSTOCK']]

# reset index
combined.reset_index(drop=True, inplace=True)

# drop orders not fulfilled
combined.drop(combined[(combined.startSTOCK <= 0) | (combined.orderQTY > combined.startSTOCK)].index, inplace=True)

# reset index
combined.reset_index(drop=True, inplace=True)

return combined.sort_values(['prodID', 'startSTOCK'], ascending=[True, False])

示例输出:

    prodID    startSTOCK    orderQTY    endSTOCK
0   ABC1      37200         456         36744
1   ABC2      38945         703         38242
2   ABC3      38085         359         37726
3   ABC4      43775         492         43283
4   ABC5      10999         824         10175
..  ...       ...           ...         ...
[300 rows x 4 columns]

我担心的是,对于较大的输入数据,此功能可能效率不高。有没有更有效的方法来达到我想要的结果?

1 个答案:

答案 0 :(得分:0)

设置

df_order

输出

  prodID  orderQTY
0   ABC1       456
1   ABC2       703
2   ABC3       359
3   ABC4       492
4   ABC5       824

将DataFrames(dfs)的列表作为您重复显示三次的第一个DataFrame的第一部分。

dfs = [df_order] * 3
df_stock = df_stock.set_index('prodID')

过程

df_orders = pd.concat(dfs).groupby('prodID').sum()
df_all = df_stock.join(df_orders)
df_all['endSTOCK'] = df_all['onHand'] - df_all['orderQTY']
df_all

返回

        onHand  orderQTY  endSTOCK
prodID
ABC1     37200      1368     35832
ABC2     38945      2109     36836
ABC3     38085      1077     37008
ABC4     43775      1476     42299
ABC5     10999      2472      8527