我有以下DataFrame:
Variable Value Classification
Variable_1 18
Variable_1 25
Variable_1 16
Variable_1 34
Variable_2 37
Variable_2 22
Variable_2 14
Variable_2 26
我想通过与下表中定义的间隔/范围进行比较,将分类值分配给上表中的空白列。
Variable Classif from to
Variable_1 A 17 24
Variable_1 B 25 30
Variable_1 C 31 35
Variable_2 A 10 19
Variable_2 B 20 25
Variable_2 C 26 50
第一个表只是实际数据框的一个示例(原始数据框有2万多行)。
有人可以推荐一种有效的方法吗? 预先感谢
答案 0 :(得分:1)
如上所述,条件中存在一些问题,因为只有两个值满足条件。我添加了一个Condition Met?
列以使您形象化,然后可以从该列删除该列或仅保留True
行。
在df
下面的数据中,是您问题中的第一个数据框,而在df2
中则是第二个数据框:
df2 = pd.merge(df,df1,how='left',on='Variable')
df2['Condition Met?'] = df2['Value'].between(df2['from'], df2['to'])
df2 = df2.sort_values(['Variable', 'Value', 'Condition Met?']).drop_duplicates(['Variable', 'Value'], keep='last')
# df2 = df2[df2['Condition Met?']].drop('Condition Met?', axis=1)
df2
Out[1]:
Variable Value Classif from to Condition Met?
0 Variable_1 18 A 17 24 True
11 Variable_1 37 C 31 35 False
8 Variable_1 54 C 31 35 False
5 Variable_1 65 C 31 35 False
16 Variable_2 22 B 20 25 True
14 Variable_2 37 C 26 50 True
23 Variable_2 66 C 26 50 False
20 Variable_2 78 C 26 50 False
放下满足条件后? False
的行以及列本身:
df2 = pd.merge(df,df1,how='left',on='Variable')
df2['Condition Met?'] = df2['Value'].between(df2['from'], df2['to'])
df2 = df2.sort_values(['Variable', 'Value', 'Condition Met?']).drop_duplicates(['Variable', 'Value'], keep='last')
df2 = df2[df2['Condition Met?']].drop('Condition Met?', axis=1)
df2
Out[2]:
Variable Value Classif from to
0 Variable_1 18 A 17 24
16 Variable_2 22 B 20 25
14 Variable_2 37 C 26 50
或者,如果不满足条件,则可以在NaN
列中返回Classif
。
df2 = pd.merge(df,df1,how='left',on='Variable')
df2['Condition Met?'] = df2['Value'].between(df2['from'], df2['to'])
df2 = df2.sort_values(['Variable', 'Value', 'Condition Met?']).drop_duplicates(['Variable', 'Value'], keep='last')
df2['Classif'] = df2['Classif'].where(df2['Condition Met?'],np.nan)
df2 = df2.drop('Condition Met?', axis=1)
df2
Out[3]:
Variable Value Classif from to
0 Variable_1 18 A 17 24
11 Variable_1 37 NaN 31 35
8 Variable_1 54 NaN 31 35
5 Variable_1 65 NaN 31 35
16 Variable_2 22 B 20 25
14 Variable_2 37 C 26 50
23 Variable_2 66 NaN 26 50
20 Variable_2 78 NaN 26 50