根据其他行更新熊猫数据框的行

时间:2019-01-16 13:43:34

标签: python pandas

我有一个pandas数据框,它具有以下列(pk1,pk2类型,qty_6,qty_7)。我的类型为预测_90,替代_90,预测_50,替代50。现在基于pk1和pk2的组合如果对于预测类型_50,预测_90除NaN之外还包含替代_50,替代_90的一些值,我想用替代_50更新我的数据框列预测_50,预测_90和override_90。另外,我想在名为qty_6_overridden,qty_7_overridden的布尔列中捕获此更改。另外,我想在qty_6_dev,qty_7_dev列中捕获两者之间的差异。 qty_6_dev = qty_6覆盖-已预测qty_6

示例数据框:

data=[
['B01FV0FBX4','2019-01-13','predicted_90',2207.931,2217.841],
['B01FV0FBX4','2019-01-13','predicted_50',1561.033,1521.567],
['B01FV0FBX4','2019-01-13','override_90',1973.000,np.NaN],
['B01FV0FBX4','2019-01-13','override_50',1233.000,np.NaN],
['B01FV0FBX4','2019-01-06','override_50',np.NaN,1233.000],
['B01FV0FBX4','2019-01-06','predicted_50',1210.129,1213.803],
['B01FV0FBX4','2019-01-06','override_90',np.NaN,1973.000],
['B01FV0FBX4','2019-01-06','predicted_90',1911.205,1921.594]
]

df = pd.DataFrame(data,columns=['pk1','pk2', 'type', 'qty_6', 'qty_7'])

预期输出:

data=[
['B01FV0FBX4','2019-01-13','predicted_90',1973.000,2217.841,-234.931,0,True,False],
['B01FV0FBX4','2019-01-13','predicted_50',1233.000,1521.567,-328.033,0,True,False],
['B01FV0FBX4','2019-01-13','override_90',1973.000,np.NaN,0,0,False,False],
['B01FV0FBX4','2019-01-13','override_50',1233.000,np.NaN,0,0,False,False],
['B01FV0FBX4','2019-01-06','override_50',np.NaN,1233.000,0,0,False,False],
['B01FV0FBX4','2019-01-06','predicted_50',1210.129,1213.000,0,-0.803,False,True],
['B01FV0FBX4','2019-01-06','override_90',np.NaN,1973.000,0,0,False,False],
['B01FV0FBX4','2019-01-06','predicted_90',1911.205,1973.000,0,51.406,False,True]
]
df = pd.DataFrame(data,columns=['pk1','pk2', 'type', 'qty_6', 'qty_7','qty_6_dev','qty_7_dev', 'qty_6_overridden','qty_7_overridden'])

在示例中,您可以看到带有优先级的数量与预计的数量交换,我们获得相应的列'qty_6_dev','qty_7_dev','qty_6_overridden','qty_7_overridden'。

我能够写一个解决方案。它可以工作,但是看起来很恐怖,其他人很难理解。

import pandas as pd
import numpy as np
import math

data=[
['B01FV0FBX4','2019-01-13','predicted_90',2207.931,2217.841],
['B01FV0FBX4','2019-01-13','predicted_50',1561.033,1521.567],
['B01FV0FBX4','2019-01-13','override_90',1973.000,np.NaN],
['B01FV0FBX4','2019-01-13','override_50',1233.000,np.NaN],
['B01FV0FBX4','2019-01-06','override_50',np.NaN,1233.000],
['B01FV0FBX4','2019-01-06','predicted_50',1210.129,1213.803],
['B01FV0FBX4','2019-01-06','override_90',np.NaN,1973.000],
['B01FV0FBX4','2019-01-06','predicted_90',1911.205,1921.594]
]
df = pd.DataFrame(data,columns=['pk1','pk2', 'type', 'qty_6', 'qty_7'])

override_map = {
    "predicted_50" : "override_50",
    "predicted_90" : "override_90"
}

def transform_df(df):
    transformed_df = pd.DataFrame()
    for index, row in df.iterrows():
        row_type = row['type']
        row_pk1 = row['pk1']
        row_pk2 = row['pk2']

        if row_type in override_map.keys():
            override_type = override_map.get(row_type)
        else:
            for i in range(6,8):
                qty_dev_col = 'qty_'+str(i)+'_dev'
                qty_override_col = 'qty_'+str(i)+'_overridden'
                row[qty_dev_col] = 0
                row[qty_override_col] = False
            transformed_df=transformed_df.append(row, ignore_index=True)
            continue
        corr_df = df.loc[(df.type == override_type) 
                         & (df.pk1 == row_pk1) 
                         & (df.pk2 == row_pk2)]

        for i in range(6,8):
            qty_col = 'qty_'+str(i)
            qty_dev_col = 'qty_'+str(i)+'_dev'
            qty_override_col = 'qty_'+str(i)+'_overridden'
            if not (math.isnan(corr_df[qty_col])) and (corr_df[qty_col].values[0] != row[qty_col]):
                row[qty_dev_col] = corr_df[qty_col].values[0] - row[qty_col]
                row[qty_col] = corr_df[qty_col].values[0]
                row[qty_override_col] = True
            else:
                row[qty_dev_col] = 0
                row[qty_override_col] = False
        transformed_df=transformed_df.append(row, ignore_index=True)
    return transformed_df

x1 = transform_df(df)

是否有更好的方法可以使用lambda或其他方式执行此操作?同样,这需要永远在更大的数据帧上运行。

0 个答案:

没有答案