如何使用 Pandas 在数据框中过滤和创建新列

时间:2021-05-01 18:25:53

标签: python pandas dataframe conditional-statements

我正在尝试过滤数据框的 3 列,并为 3 列设置条件,并返回一个二进制值,如果满足所有条件,则为 1,如果不满足条件,则为 0。示例如下所示。

data = {'PassengerId': array([2255, 2257, 2258, 2256, 2257, 2258, 2255, 2258, 2257, 2257, 2255,
        2255, 2257, 2256, 2257, 2256, 2255, 2258, 2258, 2256, 2256, 2257,
        2258, 2258, 2257]),
 'Pclass': array([3, 2, 2, 2, 4, 3, 3, 4, 3, 1, 1, 1, 1, 2, 4, 3, 1, 2, 4, 3, 2, 3,
        1, 1, 2]),
 'Age': array([40, 33, 32, 40, 48, 24, 33, 29, 29, 31, 45, 47, 28, 32, 54, 39, 28,
        50, 40, 31, 51, 26, 41, 46, 27]),
 'SibSp': array([11, 13, 12, 19, 22, 17, 23, 12, 12, 12, 12, 24, 16, 21, 12, 15, 20,
        18, 10, 17, 20, 12, 17, 17, 10]),
 'Comf' : array([236.66883531, 235.46750709, 235.64574546, 241.16838089,
        239.40728836, 239.95592634, 236.67806901, 237.73350635,
        238.74497849, 235.17486552, 235.8457374 , 236.85133744,
        240.9359547 , 236.27703374, 237.81871052, 241.62788018,
        241.29185342, 235.0058136 , 240.69989317, 238.8073828 ,
        238.08841364, 236.55259788, 237.58108419, 239.66916186,
        241.97479544]),
 'Parch': array([232.37686437, 232.39153096, 230.56566556, 232.77980061,
        232.19436342, 232.2165835 , 232.28145641, 231.26988217,
        230.55287196, 232.26528521, 230.45185855, 230.87525326,
        231.38775744, 232.80960083, 232.33105822, 232.65782351,
        231.64457366, 230.45225829, 231.05404057, 232.38229998,
        232.57354117, 232.08690375, 230.40414215, 230.14361969,
        231.40414745]),
 'Fare': array([238.80427104, 239.32031287, 238.02212358, 238.40333494,
        238.85929097, 239.51666683, 239.87771029, 238.06772515,
        238.22734658, 238.54682118, 238.68880278, 239.79658425,
        238.2642908 , 239.22884058, 239.84423352, 239.69438831,
        238.85871719, 238.64632848, 238.7085097 , 239.5700877 ,
        239.06199698, 238.37341378, 239.16126748, 239.01280153,
        239.77047796])}

df = pd.DataFrame(data)

我试图为第一行设置一个条件,如果“Pclass”== 1 和“Comf”介于“Parch”和“Fare”之间,则创建一个新列“Survived”并分配 1 否则分配 0 .

然后对 "Pclass" == 2, 3... 做同样的事情

我想用熊猫来做这件事,但是欢迎所有解决这个问题的方法。

3 个答案:

答案 0 :(得分:0)

使用 assign 只需计算条件并转换为 int 类型:

df = pd.DataFrame(data=data)

df = df.assign(Survived=lambda x: x['Comf'].between(x['Parch'], x['Fare']).astype(int))

print(df.to_string())

或与=

df = pd.DataFrame(data=data)

df['Survived'] = df['Comf'].between(df['Parch'], df['Fare']).astype(int)

print(df.to_string())

输出:

    PassengerId  Pclass  Age  SibSp        Comf       Parch        Fare  Survived
0          2255       3   40     11  236.668835  232.376864  238.804271         1
1          2257       2   33     13  235.467507  232.391531  239.320313         1
2          2258       2   32     12  235.645745  230.565666  238.022124         1
3          2256       2   40     19  241.168381  232.779801  238.403335         0
4          2257       4   48     22  239.407288  232.194363  238.859291         0
5          2258       3   24     17  239.955926  232.216584  239.516667         0
6          2255       3   33     23  236.678069  232.281456  239.877710         1
7          2258       4   29     12  237.733506  231.269882  238.067725         1
8          2257       3   29     12  238.744978  230.552872  238.227347         0
9          2257       1   31     12  235.174866  232.265285  238.546821         1
10         2255       1   45     12  235.845737  230.451859  238.688803         1
11         2255       1   47     24  236.851337  230.875253  239.796584         1
12         2257       1   28     16  240.935955  231.387757  238.264291         0
13         2256       2   32     21  236.277034  232.809601  239.228841         1
14         2257       4   54     12  237.818711  232.331058  239.844234         1
15         2256       3   39     15  241.627880  232.657824  239.694388         0
16         2255       1   28     20  241.291853  231.644574  238.858717         0
17         2258       2   50     18  235.005814  230.452258  238.646328         1
18         2258       4   40     10  240.699893  231.054041  238.708510         0
19         2256       3   31     17  238.807383  232.382300  239.570088         1
20         2256       2   51     20  238.088414  232.573541  239.061997         1
21         2257       3   26     12  236.552598  232.086904  238.373414         1
22         2258       1   41     17  237.581084  230.404142  239.161267         1
23         2258       1   46     17  239.669162  230.143620  239.012802         0
24         2257       2   27     10  241.974795  231.404147  239.770478         0

答案 1 :(得分:0)

如果您想对所有行都执行此操作,而不管 PClass 值如何,都可以使用

df["Survived"] = df["Comf"].between(df["Parch"], df["Fare"]).astype(int)

但是如果你想为特定的PClass做而不是你可以使用

df["Survived"] = (df["Pclass"]==1 & df["Comf"].between(df["Parch"], df["Fare"])).astype(int)

答案 2 :(得分:0)

试试这个。

步骤。

  1. 根据您的情况获取索引。

indexesOfTrue = df[(df["Pclass"]==1) & (df["Comf"] > df["Parch"]) & (df["Comf"] < df["Fare"])].index

  1. 使用 loc 填充索引。

df.loc[indexesOfTrue, "Survived"] = 1

  1. 填充不真实的索引。

df.loc[~df.index.isin(ind), "Survived"] = 0

输出

PassengerId  Pclass  Age  SibSp Comf       Parch        Fare  Survived
    5   2258    3   24  17  239.955926  232.216584  239.516667  2
    6   2255    3   33  23  236.678069  232.281456  239.877710  2
    7   2258    4   29  12  237.733506  231.269882  238.067725  2
    8   2257    3   29  12  238.744978  230.552872  238.227347  2
    9   2257    1   31  12  235.174866  232.265285  238.546821  1
    10  2255    1   45  12  235.845737  230.451859  238.688803  1
    11  2255    1   47  24  236.851337  230.875253  239.796584  1
    12  2257    1   28  16  240.935955  231.387757  238.264291  2
    13  2256    2   32  21  236.277034  232.809601  239.228841  2
    14  2257    4   54  12  237.818711  232.331058  239.844234  2