Question

我正在寻找一种方法来使用pandas和python将具有已知列名的excel表中的多个列组合成一个新的单个列，保留所有重要信息，如下例所示：

输入：

ID,tp_c,tp_b,tp_p  
0,transportation - cars,transportation - boats,transportation - planes
1,checked,-,-
2,-,checked,-
3,checked,checked,-
4,-,checked,checked
5,checked,checked,checked

期望的输出：

ID,tp_all  
0,transportation  
1,cars  
2,boats  
3,cars+boats  
4,boats+planes  
5,cars+boats+planes

ID为0的行包含列内容的描述。理想情况下，代码将解析第二行中的描述，查看“ - ”并在新的“tp_all”列中连接这些值。

Answer 1

这很有意思，因为它是反向get_dummies ...

我想我会手动删除列名，以便你有一个布尔数据框：

In [11]: df1  # df == 'checked'
Out[11]:
    cars  boats planes
0
1   True  False  False
2  False   True  False
3   True   True  False
4  False   True   True
5   True   True   True

现在你可以使用zip with zip：

In [12]: df1.apply(lambda row: '+'.join([col for col, b in zip(df1.columns, row) if b]),
                   axis=1)
Out[12]:
0
1                 cars
2                boats
3           cars+boats
4         boats+planes
5    cars+boats+planes
dtype: object

现在你只需调整标题，即可获得所需的csv。

如果以较少的手动方式/更快的速度进行反向get_dummies会很好......

Answer 2

确定更动态的方法：

In [63]:
# get a list of the columns
col_list = list(df.columns)
# remove 'ID' column
col_list.remove('ID')
# create a dict as a lookup
col_dict = dict(zip(col_list, [df.iloc[0][col].split(' - ')[1] for col in col_list]))
col_dict
Out[63]:
{'tp_b': 'boats', 'tp_c': 'cars', 'tp_p': 'planes'}
In [64]:
# define a func that tests the value and uses the dict to create our string
def func(x):
    temp = ''
    for col in col_list:
        if x[col] == 'checked':
            if len(temp) == 0:
                temp = col_dict[col]
            else:
                temp = temp + '+' + col_dict[col]
    return temp
df['combined'] = df[1:].apply(lambda row: func(row), axis=1)
df
Out[64]:
   ID                   tp_c                    tp_b                     tp_p  \
0   0  transportation - cars  transportation - boats  transportation - planes   
1   1                checked                     NaN                      NaN   
2   2                    NaN                 checked                      NaN   
3   3                checked                 checked                      NaN   
4   4                    NaN                 checked                  checked   
5   5                checked                 checked                  checked   

            combined  
0                NaN  
1               cars  
2              boats  
3         cars+boats  
4       boats+planes  
5  cars+boats+planes  

[6 rows x 5 columns]
In [65]:

df = df.ix[1:,['ID', 'combined']]
df
Out[65]:
   ID           combined
1   1               cars
2   2              boats
3   3         cars+boats
4   4       boats+planes
5   5  cars+boats+planes

[5 rows x 2 columns]

Answer 3

这是一种方式：

newCol = pandas.Series('',index=d.index)
for col in d.ix[:, 1:]:
    name = '+' + col.split('-')[1].strip()
    newCol[d[col]=='checked'] += name
newCol = newCol.str.strip('+')

然后：

>>> newCol
0                 cars
1                boats
2           cars+boats
3         boats+planes
4    cars+boats+planes
dtype: object

您可以使用此列创建新的DataFrame，也可以使用它执行您喜欢的操作。

编辑：我看到您已编辑了您的问题，因此传输模式的名称现在位于第0行而不是列标题中。如果他们在列标题中更容易（正如我的回答所假设的那样），并且您的新列标题似乎不包含任何其他有用信息，那么您应该首先将列名设置为第0行的信息，并删除第0行。

编辑然后将几列的值连接成一个列（pandas，python）

3 个答案: