Question

我的问题很简单-我有下表：

+----------+-------+------------+--------+
| industry | class | occupation | value  |
+----------+-------+------------+--------+
|      170 |     4 |       1000 |  123.3 |
|      180 |     7 |       3600 | 4543.8 |
|      570 |     5 |        990 |  657.4 |
+----------+-------+------------+--------+

我想创建一个称为“类型”的新列。此列的值是基于以下多个条件

Class = 7：QWE
Class = 8：ASD
Class = 1或2：ZXC
类别= 4、5或6 AND行业= 170-490或570-690 AND职业> = 1000：IOP
类别= 4、5或6 AND行业= 170-490或570-690 AND在10-3540之间的职业：JKL
其他：BNM

结果表将如下所示：

+----------+-------+------------+--------+------+
| industry | class | occupation | value  | type |
+----------+-------+------------+--------+------+
|      170 |     4 |       1000 |  123.3 | IOP  |
|      180 |     7 |       3600 | 4543.8 | QWE  |
|      570 |     5 |        990 |  657.4 | JKL  |
+----------+-------+------------+--------+------+

我的第一种方法基本上是使用数据框查询方法创建每种类型的多个数据框。但是，我发现了有关numpy的“ where”方法的信息，并且我目前正在使用该方法的嵌套版本一步创建“ type”列。但是，我觉得这是无法理解的，并且我可以想象出现这样一种情况，其中还有更多条件使该过程看起来非常混乱。有没有更干净的方法可以做到这一点？也许有字典之类的东西？

Answer 1

设置条件和输出并存储在列表中

a = df['class'].eq(7)  
b = df['class'].eq(8)  
c = df['class'].isin([1,2])    
helper = df['class'].isin([4,5,6]) & (df.industry.isin(range(170, 491)) | df.industry.isin(range(570, 691)))
d =  helper & df.occupation.ge(1000)
e = helper & df.occupation.isin(range(10, 3541))

conds = [a, b, c, d, e]
outs = ['QWE', 'ASD', 'ZXC', 'IOP', 'JKL']

使用 np.select 。请注意，您有重叠的条件，因此IOP和JKL

之间可能存在歧义

df['out'] = np.select(conds, outs, default='BNM')

   industry  class  occupation   value  out
0       170      4        1000   123.3  IOP
1       180      7        3600  4543.8  QWE
2       570      5         990   657.4  JKL

基于多种条件创建列的干净方法

1 个答案: