Dataframe列基于4个条件,嵌套np.where

时间:2018-03-20 19:17:02

标签: python pandas numpy dataframe

我正在使用的数据框有2个列和数百个组的4种可能组合。

| Group |   Before   |    After   |
|:-----:|:----------:|:----------:|
|   G1  |  Injection |  Injection |
|   G1  |  Injection | Production |
|   G1  | Production |  Injection |
|   G1  | Production | Production |

有3个预先计算的列需要根据前/后组合进行拉取,如下所示。

| Group |   Before   |    After   |         Output         |
|:-----:|:----------:|:----------:|:----------------------:|
|   G1  |  Injection |  Injection |        df['DTI']       |
|   G1  |  Injection | Production | df['DTWF'] + df['DTP'] |
|   G1  | Production |  Injection | df['DTWF'] + df['DTI'] |
|   G1  | Production | Production |        df['DTP']       |

我尝试过嵌套多个np.where&#39>

np.where(df['Before'] == 'Injection' & df['After'] == 'Injection', df['DTI'],
np.where(....))

导致:

  

ValueError:应该给出x和y两者或两者都不

并嵌套多个np.logical:

np.where(np.logical_and(df['Before'] == 'Injection' & df['After'] == 'Injection'), df['DTP'])

导致:

  

DataFrame的真值是不明确的。使用a.empty,a.bool(),a.item(),a.any()或a.all()。

我已达到我能做的上限,需要一些想法!

2 个答案:

答案 0 :(得分:0)

一种方法是使用apply函数:

假设您的DataFrame位于变量df中,您可以执行以下操作:

import pandas as pd

df = pd.DataFrame(data={"Before": ["Injection", "Injection", "Production", "Production"],
                        "After": ["Injection", "Production", "Injection", "Production"]})
def get_output(x):
    if x['Before'] == 'Injection' and x['After'] == 'Injection':
        return 'DTI'
    elif x['Before'] == 'Injection' and x['After'] == 'Production':
        return 'DTWF + DTP'
    elif x['Before'] == 'Production' and x['After'] == 'Injection':
        return 'DTWF + DTI'
    elif x['Before'] == 'Production' and x['After'] == 'Production':
        return 'DTP'

df['Output'] = df.apply(get_output, axis=1)

答案 1 :(得分:0)

Before["Injection"]没有按照您的想法行事。在您展示的代码中,它甚至没有定义。

你可能想要的是:

# df definition, skipping Group because it is not needed here
df = pd.DataFrame(data={"Before": ["Injection", "Injection", "Production", "Production"], "After": ["Injection", "Production", "Injection", "Production"]})

df["Output"] = "DTI"  # Use one of the cases as default
df.loc[(df["Before"] == "Injection") & (df["After"] == "Production"), "Output"] = "DTWF + DTP"
df[(df["Before"] == "Production") & (df["After"] == "Injection"), "Output"] = "DTWF + DTI"
df[(df["Before"] == "Production") & (df["After"] == "Production"), "Output"] = "DTP"
print(df)
#         After      Before      Output
# 0   Injection   Injection         DTI
# 1  Production   Injection  DTWF + DTP
# 2   Injection  Production  DTWF + DTI
# 3  Production  Production         DTP

如果你有很多这样的组合,那么使用其他答案中建议的apply可能更合适。

如果你有很多行,将布尔索引(例如df["Before"] == "Production")保存到变量可能是有意义的

before_prod = df["Before"] == "Production"
after_prod = df["After"] == "Production"
df.loc[before_prod & after_prod, "Output"] = "DTP"
...

如果你也只有这两个状态,你可以使用一元否定算子~免费获得第二个(几乎):

df.loc[before_prod & ~after_prod, "Output"] = "DTWF + DTI"