Question

我的数据如下所示，我试图用给定的值创建列输出。

      a_id b_received c_consumed
  0    sam       soap        oil
  1    sam        oil        NaN
  2    sam      brush       soap
  3  harry        oil      shoes
  4  harry      shoes        oil
  5  alice       beer       eggs
  6  alice      brush      brush
  7  alice       eggs        NaN

生成数据集的代码是

df = pd.DataFrame({'a_id': 'sam sam sam harry harry alice alice alice'.split(),
               'b_received': 'soap oil brush oil shoes beer brush eggs'.split(),
               'c_consumed': 'oil NaN soap shoes oil eggs brush NaN'.split()})

我想要一个名为Output的新列，看起来像这样

      a_id b_received c_consumed   output
  0    sam       soap        oil   1
  1    sam        oil        NaN   1
  2    sam      brush       soap   0
  3  harry        oil      shoes   1
  4  harry      shoes        oil   1
  5  alice       beer       eggs   0
  6  alice      brush      brush   1 
  7  alice       eggs        NaN   1

所以搜索是如果sam收到肥皂，油和刷子，在他消费的产品中寻找'消费'栏中的值，所以如果消耗了肥皂，输出将是1，但由于刷子没有消耗，输出是0.

同样地，对于哈里来说，他收到油和鞋子，然后在消耗的柱子中寻找油和鞋，如果油被消耗，则输出为1.

为了使其更清楚，输出值对应于第一列（已接收），取决于第二列（消耗）中存在的值。

我尝试使用此代码

   a=[]
   for i in range(len(df.b_received)):
         if any(df.c_consumed == df.b_received[i] ):
              a.append(1)
         else:
              a.append(0)

   df['output']=a

这给了我输出

       a_id b_received c_consumed  output
  0    sam       soap        oil       1
  1    sam        oil        NaN       1
  2    sam      brush       soap       1
  3  harry        oil      shoes       1
  4  harry      shoes        oil       1
  5  alice       beer       eggs       0
  6  alice      brush      brush       1
  7  alice       eggs        NaN       1

问题在于，由于sam没有消耗画笔，因此输出应为0但输出为1，因为画笔是由另一个人（alice）消耗的。我需要确保不会发生这种情况。输出需要特定于每个人的消费。

我知道这很令人困惑，所以如果我没有说清楚，请问，我会回答你的意见。

Answer 1

密钥是pandas.Series.isin()，它检查传递给pandas.Series的对象中调用pandas.Series.isin()中每个元素的成员资格。您要检查b_received中c_consumed的{{1}}中每个元素的成员资格，但仅限于a_id定义的每个组中的成员资格。将groupby与apply pandas一起使用时，将通过分组变量及其原始索引来索引对象。在您的情况下，您不需要索引中的分组变量，因此您可以使用reset_index将索引重置为drop=True最初的索引。

df['output'] = (df.groupby('a_id')
               .apply(lambda x : x['b_received'].isin(x['c_consumed']).astype('i4'))
               .reset_index(level='a_id', drop=True))

您的DataFrame现在是......

    a_id b_received c_consumed  output
0    sam       soap        oil       1
1    sam        oil        NaN       1
2    sam      brush       soap       0
3  harry        oil      shoes       1
4  harry      shoes        oil       1
5  alice       beer       eggs       0
6  alice      brush      brush       1
7  alice       eggs        NaN       1

查看带有pandas的split-apply-combine文档，以获得更全面的解释。

Answer 2

这应该有效，尽管理想的方法是JaminSore给出的方法

df['output'] = 0

ctr = 0

for names in df['a_id'].unique():
    for n, row in df.loc[df.a_id == names].iterrows():
        if row['b_received'] in df.loc[df.a_id == names]['c_consumed'].values:
            df.ix[ctr:]['output']=1
            ctr+=1
        else:
            df.ix[ctr:]['output']=0
            ctr+=1

现在数据框

    a_id b_received c_consumed  output
0    sam       soap        oil       1
1    sam        oil        NaN       1
2    sam      brush       soap       0
3  harry        oil      shoes       1
4  harry      shoes        oil       1
5  alice       beer       eggs       0
6  alice      brush      brush       1
7  alice       eggs        NaN       1

比较2列中的值，并将结果输出到pandas中的第三列

2 个答案: