我有一个pandas数据帧:
apple banana carrot diet coke
1 1 1 0
0 1 0 0
1 0 0 0
1 0 1 1
0 1 1 0
0 1 1 0
我想将此转换为以下内容:
[['apple', 'banana', 'carrot'],
['banana'],
['apple'],
['apple', 'carrot', 'diet coke'],
['banana', 'carrot'],
['banana', 'carrot']]
我该怎么办?非常感谢。
答案 0 :(得分:6)
因为生命很短暂,我可能会像
那样直截了当>>> fruit = [df.columns[row.astype(bool)].tolist() for row in df.values]
>>> pprint.pprint(fruit)
[['apple', 'banana', 'carrot'],
['banana'],
['apple'],
['apple', 'carrot', 'diet coke'],
['banana', 'carrot'],
['banana', 'carrot']]
这是有效的,因为我们可以使用布尔数组(row.astype(bool)
)来仅选择行为True的df.columns
元素。
答案 1 :(得分:2)
@DSM解决方案非常棒,但只有在您的值为1
或0
时才有效。如果您需要将其与其他值进行比较,您可以尝试:
[df.columns[df.ix[i,:]==1].tolist() for i in range(len(df.index))]
In [156]: [df.columns[df.ix[i,:]==1].tolist() for i in range(len(df.index))]
Out[156]:
[['apple', 'banana', 'carrot'],
['banana'],
['apple'],
['apple', 'carrot', 'dietcoke'],
['banana', 'carrot'],
['banana', 'carrot']]
修改强>
虽然你可以修改一下@DSM解决方案:
In [177]: [df.columns[row == 1].tolist() for row in df.values]
Out[177]:
[['apple', 'banana', 'carrot'],
['banana'],
['apple'],
['apple', 'carrot', 'dietcoke'],
['banana', 'carrot'],
['banana', 'carrot']]
一些性能测试:
In [179]: %timeit [df.columns[row == 1].tolist() for row in df.values]
The slowest run took 4.03 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 212 us per loop
In [180]: %timeit [df.columns[row.astype(bool)].tolist() for row in df.values]
10000 loops, best of 3: 186 us per loop
In [181]: %timeit [df.columns[df.ix[i,:]==1].tolist() for i in range(len(df.index))]
100 loops, best of 3: 2.4 ms per loop
答案 2 :(得分:1)
In [24]: import pandas as pd
In [25]: import io
In [26]: data = """
apple banana carrot dietcoke
1 1 1 0
0 1 0 0
1 0 0 0
1 0 1 1
0 1 1 0
0 1 1 0
"""
In [27]: df = pd.read_csv(io.StringIO(data), delimiter='\s+')
In [28]: df
Out[28]:
apple banana carrot dietcoke
0 1 1 1 0
1 0 1 0 0
2 1 0 0 0
3 1 0 1 1
4 0 1 1 0
5 0 1 1 0
In [29]: [[df.columns[i] for i,field in enumerate(record) if field == 1] for j,*record in df.itertuples()]
Out[29]:
[['apple', 'banana', 'carrot'],
['banana'],
['apple'],
['apple', 'carrot', 'dietcoke'],
['banana', 'carrot'],
['banana', 'carrot']]
不使用列表推导和扩展元组解包的解决方案如下所示:
In [32]: result = []
In [33]: for record in df.itertuples():
....: row = []
....: for i,field in enumerate(record[1:]):
....: if field == 1:
....: row.append(df.columns[i])
....: result.append(row)
....:
In [34]: result
Out[34]:
[['apple', 'banana', 'carrot'],
['banana'],
['apple'],
['apple', 'carrot', 'dietcoke'],
['banana', 'carrot'],
['banana', 'carrot']]
答案 3 :(得分:1)
您可以在Pedro提及的情况下进行编辑和创建,或者只使用stack()
和groupby()
列出,
df
Out[14]:
apple banana carrot diet_coke
0 1 1 1 0
1 0 1 0 0
2 1 0 0 0
3 1 0 1 1
4 0 1 1 0
5 0 1 1 0
df.stack()
Out[15]:
0 apple 1
banana 1
carrot 1
diet_coke 0
1 apple 0
banana 1
carrot 0
diet_coke 0
2 apple 1
banana 0
carrot 0
diet_coke 0
3 apple 1
banana 0
carrot 1
diet_coke 1
4 apple 0
banana 1
carrot 1
diet_coke 0
5 apple 0
banana 1
carrot 1
diet_coke 0
dtype: int64
df.stack()[df.stack().values ==1].reset_index()
Out[20]:
level_0 level_1 0
0 0 apple 1
1 0 banana 1
2 0 carrot 1
3 1 banana 1
4 2 apple 1
5 3 apple 1
6 3 carrot 1
7 3 diet_coke 1
8 4 banana 1
9 4 carrot 1
10 5 banana 1
11 5 carrot 1
newdf.groupby(['level_0'])['level_1'].apply(list)
Out[27]:
level_0
0 [apple, banana, carrot]
1 [banana]
2 [apple]
3 [apple, carrot, diet_coke]
4 [banana, carrot]
5 [banana, carrot]
Name: level_1, dtype: object