我试图创建一个标志变量(即一个包含二进制值的新列,如1表示True,0表示False) - 我已经尝试了np.where
{{3} })和df.where
无济于事。
使用df.where:
df.where(((df['MOSL_Rating'] == 'Highly Effective') & (df['MOTP_Rating'] == 'Developing')) | ((df['MOSL_Rating'] == 'Highly Effective') & (df['MOTP_Rating'] == 'Ineffective')) | ((df['MOSL_Rating'] == 'Effective') & (df['MOTP_Rating'] == 'Ineffective')) | ((df['MOSL_Rating'] == 'Ineffective') & (df['MOTP_Rating'] == 'Highly Effective')) | ((df['MOSL_Rating'] == 'Ineffective') & (df['MOTP_Rating'] == 'Effective')) | ((df['MOSL_Rating'] == 'Developing') & (df['MOTP_Rating'] == 'Highly Effective')), df['disp_rating'], 1, axis=1)
但这会返回ValueError: For argument "inplace" expected type bool, received type int.
如果我将代码从df['disp_rating'], 1, axis=1
更改为df['disp_rating'], True, axis=1
,则返回T ypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
我还试过np.where
,但返回ValueError: either both or neither of x and y should be given
我还阅读as per this post,看起来很相似。但是,当我使用其中提供的解决方案时,它返回:
KeyError: 'disp_rating'
如果我提前创建变量(以避免Key Error
),我只会收到有关其他内容的其他错误。
我认为根据一些基本条件创建一个新变量会非常简单,但我已经坚持了一段时间,尽管阅读了文档和很多SO帖子但我并没有取得任何进展
编辑:为了更清楚,我尝试根据其他2列(' MOSL_Rating中的值)创建新列(名为' disp_rating')并且在同一df内的' MOTP_Rating')满足某些条件。我只有1个数据帧,因此我没有尝试比较2个数据帧。 在SQL中我会使用CASE WHEN语句,在SAS中我会使用IF / THEN / ELSE语句。
我的df通常看起来像这样:
ID Loc MOSL_rating MOTP_Rating
12 54X D E
45 86I D I
98 65R H H
答案 0 :(得分:0)
我找不到为什么哪里不起作用,但这是一种方法:
创建代码来创建你的df:
def make_row():
import random
dico = {"MOSL_Rating" : ['Highly Effective', 'Effective', 'Ineffective', 'Developing'],
"MOTP_Rating" : ['Developing', 'Ineffective', 'Highly Effective', 'Effective', 'Highly Effective'],
"disp_rating" : range(100)}
row = {}
for k in dico.keys():
v = random.choice(dico[k])
row[k] =v
return row
def make_df(nb_row):
import pandas as pd
rows = [make_row() for i in range(nb_row)]
return pd.DataFrame(rows)
我可以创建一个df:
df = make_df(3)
MOSL_Rating MOTP_Rating disp_rating
0 Highly Effective Ineffective 39
1 Highly Effective Highly Effective 71
2 Effective Ineffective 95
和一个借口:
df2 = make_df(3)
df2
MOSL_Rating MOTP_Rating disp_rating
0 Effective Highly Effective 24
1 Effective Developing 38
2 Highly Effective Ineffective 16
然后我创建了你的测试:
MOSL_high_efective = df['MOSL_Rating'] == 'Highly Effective'
MOSL_efective = df['MOSL_Rating'] == 'Effective'
MOSL_inefective = df['MOSL_Rating'] == 'Ineffective'
MOSL_developing = df['MOSL_Rating'] == 'Developing'
MOTP_high_efective = df['MOTP_Rating'] == 'Highly Effective'
MOTP_efective = df['MOTP_Rating'] == 'Effective'
MOTP_inefective = df['MOTP_Rating'] == 'Ineffective'
MOTP_developing = df['MOTP_Rating'] == 'Developing'
test1 = MOSL_high_efective & MOTP_developing
test2 = MOSL_high_efective & MOTP_inefective
test3 = MOSL_efective & MOTP_inefective
test4 = MOSL_inefective & MOTP_high_efective
test5 = MOSL_inefective & MOTP_efective
test6 = MOSL_developing & MOTP_high_efective
conditions = test1 | test2 | test3 | test4 | test5 | test6
然后用符合条件的第二个数据帧更新第一个数据帧的值:
lines_to_be_updates = df.loc[conditions].index.values
df.loc[lines_to_be_updates, "disp_rating"] = df2[lines_to_be_updates]["disp_rating"]
df
MOSL_Rating MOTP_Rating disp_rating
0 Highly Effective Ineffective 24
1 Highly Effective Highly Effective 71
2 Effective Ineffective 16
答案 1 :(得分:0)
您的逻辑过于复杂,可以通过set
进行简化/优化。以下是演示。
d = {frozenset({'H', 'D'}),
frozenset({'H', 'I'}),
frozenset({'E', 'I'})}
df['MOSL_MOTP'] = list(map(frozenset, zip(df['MOSL_Rating'], df['MOTP_Rating'])))
df['Result'] = np.where(df['MOSL_MOTP'].isin(d), 1, 0)
# ID Loc MOSL_Rating MOTP_Rating MOSL_MOTP Result
# 0 12 54X D E (E, D) 0
# 1 45 86I D I (D, I) 0
# 2 98 65R H H (H) 0
# 3 95 66R H D (D, H) 1
# 4 96 67R D H (D, H) 1
# 5 97 68R E I (E, I) 1