合并不同的列

时间:2018-12-02 06:46:59

标签: python pandas dataframe

我有一个像这样的数据框:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
                'vote':[5,4,5,1,10,1,9],
                'doggo': [None,"doggo",None,None,"doggo",None,None], 
                'floofer': ["floofer",None,None,"floofer",None,None,None],
                'pupper': [None,None,"pupper",None,None,None,None],
               'puppo':[None,None,None,None,None,None,"puppo"]})

我想结合最后4列和gnerate:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
                    'vote':[5,4,5,1,10,1,9],
                    'categories': ["floofer","doggo","pupper","floofer","doggo",None,"puppo"]})

任何指导表示赞赏。

3 个答案:

答案 0 :(得分:1)

如果每个类别列中的每一行只有一个不为None的值,则解决方案:

cols = ['doggo','floofer','pupper','puppo']
cols1 = df.columns.difference(cols)
df2 = df[cols1].join(df[cols].ffill(axis=1).iloc[:, -1].rename('Categories'))
print (df2)
   id  vote Categories
0   1     5    floofer
1   2     4      doggo
2   3     5     pupper
3   4     1    floofer
4   5    10      doggo
5   6     1       None
6   7     9      puppo

说明

首先仅选择具有分类数据的列,并向前填充缺少的值-预期数据在最后一列:

print (df[cols].ffill(axis=1))
  doggo  floofer   pupper    puppo
0   None  floofer  floofer  floofer
1  doggo    doggo    doggo    doggo
2   None     None   pupper   pupper
3   None  floofer  floofer  floofer
4  doggo    doggo    doggo    doggo
5   None     None     None     None
6   None     None     None    puppo

按位置选择最后一列:

print (df[cols].ffill(axis=1).iloc[:, -1])
0    floofer
1      doggo
2     pupper
3    floofer
4      doggo
5       None
6      puppo
Name: puppo, dtype: object

如果有多个值,则从分类列的列名称创建数据的解决方案:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
                'vote':[5,4,5,1,10,1,9],
                'doggo': [None,"doggo1",None,"doggo2","doggo3",None,None], 
                'floofer': ["floofer1",None,None,"floofer2",None,None,None],
                'pupper': [None,None,"pupper1",None,None,None,None],
               'puppo':["puppo1",None,None,None,None,None,"puppo2"]})
print (df)
   id  vote   doggo   floofer   pupper   puppo
0   1     5    None  floofer1     None  puppo1
1   2     4  doggo1      None     None    None
2   3     5    None      None  pupper1    None
3   4     1  doggo2  floofer2     None    None
4   5    10  doggo3      None     None    None
5   6     1    None      None     None    None
6   7     9    None      None     None  puppo2


s = (df[cols].notnull()
            .dot(pd.Index(cols) + ', ')
            .str.strip(', ')
            .rename('Categories')
            .replace('', np.nan)
            )
df = df[cols1].join(s)
print (df)
   id  vote      Categories
0   1     5  floofer, puppo
1   2     4           doggo
2   3     5          pupper
3   4     1  doggo, floofer
4   5    10           doggo
5   6     1             NaN
6   7     9           puppo

另一种解决方案,预期的输出不是来自列名:

s = pd.Series(df[cols].add(', ').fillna('').values.sum(axis=1), 
                  index=df.index, name='Categories').str.strip(', ')
df = df[cols1].join(s)
print (df)
   id  vote        Categories
0   1     5  floofer1, puppo1
1   2     4            doggo1
2   3     5           pupper1
3   4     1  doggo2, floofer2
4   5    10            doggo3
5   6     1                  
6   7     9            puppo2

答案 1 :(得分:1)

bfill + iloc

您可以bfill(回填)并选择第一列:

(df.set_index(['id', 'vote'])
   .bfill(axis=1)
   .iloc[:, 0]
   .reset_index(name='Categories'))

   id  vote Categories
0   1     5    floofer
1   2     4      doggo
2   3     5     pupper
3   4     1    floofer
4   5    10      doggo
5   6     1       None
6   7     9      puppo

stack + reindex

cats = (df.drop(['id', 'vote'], 1).stack()
          .reset_index(level=1, drop=True).reindex(df.index))
pd.DataFrame(dict(id=df.id, vote=df.vote, Categories=cats))


   id  vote Categories
0   1     5    floofer
1   2     4      doggo
2   3     5     pupper
3   4     1    floofer
4   5    10      doggo
5   6     1        NaN
6   7     9      puppo

last_valid_index

缓慢但简洁。

(df.set_index(['id', 'vote'])
   .agg(lambda x: x.last_valid_index(), axis=1)
   .reset_index(name='Categories'))

   id  vote Categories
0   1     5    floofer
1   2     4      doggo
2   3     5     pupper
3   4     1    floofer
4   5    10      doggo
5   6     1       None
6   7     9      puppo

假设“ id”和“ vote”是唯一的非分类列。

答案 2 :(得分:0)

我们可以利用x or Nonex的事实,并在每一行中使用Numpy的logical_or运算符来简化/应用该类别。

import numpy as np

cols = ['doggo','floofer','pupper','puppo']
categories = np.logical_or.reduce(df[cols], axis=1)
df = df.assign(categories=categories).drop(cols, axis=1)