根据Python中的多个条件创建新列

时间:2016-06-23 17:46:40

标签: python pandas select dataframe

我有以下数据框:

# in server.js, within default loopback boot function
boot(app, __dirname, function(err) {
  if (err) throw err;

  // start the server if `$ node server.js`
  if (require.main === module) {
    if (+process.env.START_WORKERS) {
      require('./workers/start');
      return;
    } else {
      app.start();
    }
  }
});

第一列包含user_id,每一行代表他所做的一个动作。每个user_id都显示在“Actor1”或“Actor2”列中。

首先,我想创建一个新列,如果在“Actor1”列中找到user_id,则将分配值1,否则为0。

其次,我想创建一个新列,对于每个user_id,它将存储与之交互的“Actor”_i值。

对于上面的示例,输出将如下所示:

  data = [
(27450, 27450, 29420,"10/10/2016"),
(29420 , 36142, 29420, "10/10/2016"),
(11 , 11, 27450, "10/10/2016")] 

#Create DataFrame base
df = pd.DataFrame(data, columns=("User_id","Actor1","Actor2", "Time"))

最有效的pythonic方法是什么?

提前多多感谢!

2 个答案:

答案 0 :(得分:2)

import numpy as np
import pandas as pd

data = [(27450, 27450, 29420,"10/10/2016"),
        (29420 , 36142, 29420, "10/10/2016"),
        (11 , 11, 27450, "10/10/2016")] 
df = pd.DataFrame(data, columns=("User_id","Actor1","Actor2", "Time"))
mask = (df['User_id'] == df['Actor1'])
df['first actor'] = mask.astype(int)
df['other actor'] = np.where(mask, df['Actor2'], df['Actor1'])
print(df)

产量

   User_id  Actor1  Actor2        Time  first actor  other actor
0    27450   27450   29420  10/10/2016            1        29420
1    29420   36142   29420  10/10/2016            0        36142
2       11      11   27450  10/10/2016            1        27450

首先创建一个布尔掩码,当User_id等于Actor1时,该掩码为True:

In [51]: mask = (df['User_id'] == df['Actor1']); mask
Out[51]: 
0     True
1    False
2     True
dtype: bool

mask转换为ints会创建第一列:

In [52]: mask.astype(int)
Out[52]: 
0    1
1    0
2    1
dtype: int64

然后使用np.where在两个值之间进行选择。如果np.where(mask, A, B)为True,则ith会返回A[i]值为mask[i]的数组,否则为B[i]。从而, np.where(mask, df['Actor2'], df['Actor1'])Actor2 mask为真的值,Actor1的值为:

In [53]: np.where(mask, df['Actor2'], df['Actor1'])
Out[53]: array([29420, 36142, 27450])

答案 1 :(得分:0)

继承我的解决方案 - 我假设如果userid出现在actor1列中,那么它就不会在同一行......

df["Col1"] = [1 if i in df["Actor1"].values else 0 for i in df["User_id"].values]
df["Col2"] = [df.iloc[i]["Actor2"] if j == 1 else df.iloc[i]["Actor1"] for i, j in enumerate(df["Col1"].values)]

输出 -

User_id  Actor1  Actor2        Time  Col1   Col2
0    27450   27450   29420  10/10/2016     1  29420
1    29420   36142   29420  10/10/2016     0  36142
2       11      11   27450  10/10/2016     1  27450