Question

我需要合并两个Pandas数据帧。第一个是长格式数据集，其中包含各种数量休息时项目的售价。随着购买零件数量的增加，价格会下降。

Dataframe1

PART#    MY_QTY   MY_PRC
Item1    1        $20
Item1    10       $18
Item1    20       $17
Item2    1        $120
Item2    30       $100
Item2    50       $95

第二种是包含多个供应商的数量中断和销售价格的宽格式数据集。对于下面的第1项，如果我从Vend1购买1件，我支付10美元，4件仍然是10美元，5件是8美元等。数量中断的数量因商品和供应商而异，并非所有供应商都销售所有商品。

Dataframe2

PART#    VEND#   QTY1  PRC1   QTY2   PRC2   QTY3   PRC3
Item1    Vend1    1    $10     5     $8      15    $7
Item1    Vend2    1    $15     11    $12     30    $11
Item1    Vend3    1    $20     10    $18
Item2    Vend1    1    $75     20    $60     30    $55
Item2    Vend2    1    $80     12    $70

我想合并数据框，以便我可以将每个数量中断的销售价格与相同数量的供应商成本进行比较。最终的数据框将在PART＃上具有左合并的形状，其中VEND＃被转移到列。

我遇到困难的部分是根据MY_QTY获取正确的供应商价格。我应该能够阅读一行，看看所有各方对给定数量的项目收费。预期产量如下。

结果数据框

PART#    MY_QTY   MY_PRC    VEND1    VEND2    VEND3
Item1    1        $20       $10      $15      $20
Item1    10       $18       $8       $15      $18
Item1    20       $17       $7       $12      $18
Item2    1        $120      $75      $80
Item2    30       $100      $55      $70
Item2    50       $95       $55      $70

修改

人们似乎对Dataframe2感到困惑。该行数据帧是逐行读取的。第一行值显示Vend1销售的Item1的价格。对于从QTY1（1件）到QTY2（5件）的这一行，价格是PRC1（10美元），然后从QTY2（5件）到QTY3（15件），价格是PRC2（8美元）。价格保持不变，直到请求的数量达到下一个数量中断。

Say Mama的农场摊位以每个1美元的价格出售苹果。如果你买5个苹果，那么每个苹果的价格下降到0.75美元。如果你买15个苹果，那么价格会再次降至0.50美元。此示例的数据框如下所示。

PART#    VEND#   QTY1  PRC1   QTY2   PRC2   QTY3   PRC3
Apple    Mama    1     $1     5      $.75   15     $.5

Answer 1

以下是您如何做到这一点的工作示例。这绝不是有效的。其他人似乎试图加入这两个数据集，但听起来你想要的实际上是为每个供应商/零件组合获得最大QTY <= MY_QTY的价格。

import pandas as pd
from io import StringIO
import numpy as np

df1_t = StringIO("""PART#,MY_QTY,MY_PRC
Item1,1,$20
Item1,10,$18
Item1,20,$17
Item2,1,$120
Item2,30,$100
Item2,50,$95
""")

df2_t = StringIO("""PART#,VEND#,QTY1,PRC1,QTY2,PRC2,QTY3,PRC3
Item1,Vend1,1,$10,5,$8,15,$7
Item1,Vend2,1,$15,11,$12,30,$11
Item1,Vend3,1,$20,10,$18
Item2,Vend1,1,$75,20,$60,30,$55
Item2,Vend2,1,$80,12,$70
""")

df1 = pd.read_csv(df1_t)
df2 = pd.read_csv(df2_t)

vendors = df2['VEND#'].unique()
items = df2['PART#'].unique()

# for the specific item and vendor in the rows of Dataframe1 (df1), find the 
# largest QTY for that that's less than MY_QTY for the same combination of item
# and vendor in df2
def find_price(row, vendor, df2):
    item = row['PART#']
    quantity = row['MY_QTY']
    # get the row with that specific item / vendor combo
    prices = df2[(df2['PART#']==item) & (df2['VEND#']==vendor)]
    # reshape a little
    prices = pd.wide_to_long(prices, ['QTY','PRC'], i='VEND#', j='v').set_index('QTY',append=True).reset_index().drop('v',axis=1)
    # only get where QTY <= MY_QTY
    prices = prices[prices['QTY']<=quantity]
    if prices.empty:
        return np.nan
    else:
        return prices.loc[prices['QTY'].argmax(),:]['PRC']


# iterate throw the vendors, and use find_price to get the corresponding price
for vendor in vendors:
    df1[vendor] = df1.apply(lambda row: find_price(row, vendor, df2),axis=1)

print(df1)
#   PART#  MY_QTY MY_PRC Vend1 Vend2 Vend3
#0  Item1       1    $20   $10   $15   $20
#1  Item1      10    $18    $8   $15   $18
#2  Item1      20    $17    $7   $12   $18
#3  Item2       1   $120   $75   $80   NaN
#4  Item2      30   $100   $55   $70   NaN
#5  Item2      50    $95   $55   $70   NaN

Answer 2

这是另一种仅使用供应商循环但需要对数据进行排序的方式

import pandas as pd
from io import StringIO
import numpy as np

df1_t = StringIO("""PART#,MY_QTY,MY_PRC
Item1,1,$20
Item1,10,$18
Item1,20,$17
Item2,1,$120
Item2,30,$100
Item2,50,$95
""")

df2_t = StringIO("""PART#,VEND#,QTY1,PRC1,QTY2,PRC2,QTY3,PRC3
Item1,Vend1,1,$10,5,$8,15,$7
Item1,Vend2,1,$15,11,$12,30,$11
Item1,Vend3,1,$20,10,$18
Item2,Vend1,1,$75,20,$60,30,$55
Item2,Vend2,1,$80,12,$70
""")

df1 = pd.read_csv(df1_t)
df2 = pd.read_csv(df2_t)


df2 = pd.wide_to_long(df2, ['QTY','PRC'], i='VEND#', j='v').set_index('QTY',append=True).reset_index().drop('v',
    axis=1)
df1['MY_QTY'] = df1['MY_QTY'].astype(float)
df1 = df1.sort_values(by="MY_QTY")
df2 = df2.sort_values(by="QTY")
df2 = df2.dropna(axis=0, how='any')

vendors = df2['VEND#'].unique()
df3=df1
for vendor in vendors:
    df3 = pd.merge_asof(df3, df2[df2['VEND#']==vendor], left_on="MY_QTY", right_on="QTY", by='PART#',suffixes=('', '_y'))

to_drop = [x for x in df3 if x.startswith('VEND')]
to_drop = to_drop + [x for x in df3 if x.startswith('QTY')]
df3.drop(to_drop, axis=1, inplace=True)
df3 = df3.rename(columns={prc : vendor for prc, vendor in zip([x for x in df3 if x.startswith('PRC')], vendors)})

print(df3)
#     PART#  MY_QTY MY_PRC Vend1 Vend3 Vend3
#0  Item1     1.0    $20   $10   $15   $20
#1  Item2     1.0   $120   $75   $80   NaN
#2  Item1    10.0    $18    $8   $15   $18
#3  Item1    20.0    $17    $7   $12   $18
#4  Item2    30.0   $100   $55   $70   NaN
#5  Item2    50.0    $95   $55   $70   NaN

Answer 3

dfs = []
for val in ['PRC1','PRC2','PRC3']:    
    temp = pd.pivot_table(df2, index='PART#', columns='VEND#', values=val).reset_index()
    dfs.append(temp)
pivot = pd.concat(dfs, axis=0)
pivot.sort_values('PART#',inplace=True)
pivot.reset_index(inplace=True)
df1.join(pivot,lsuffix='PART#')

Pandas合并，缩放和转动长格式和宽格式数据框架

3 个答案: