我的目标是获取列值的组合。例如,
UT Fruit_1 Fruit_2 Fruit_3
0 I1 Apple Orange Peach
1 I2 Apple Lemon NaN
2 I3 Starfruit Apple Orange
在此数据框中,我想合并Fruit_ *列的值。因此,结果是(Apple,Orange),(Apple,Peach),(Orange,Peach)...
如您所见,数据框具有NaN。因此,在组合工作之后,我将删除带有特定文本的行:“ nan”。通过阅读与此任务相关的一些帖子,我编写了以下代码。
import pandas as pd
import numpy as np
from itertools import combinations
df = pd.DataFrame([['I1', 'Apple', 'Orange', 'Peach'],
['I2', 'Apple', 'Lemon', np.NAN],
['I3', 'Starfruit', 'Apple', 'Orange']],
columns=['UT', 'Fruit_1', 'Fruit_2', 'Fruit_3'])
temp1 = df.set_index ('UT')
temp2 = temp1.apply (lambda x: list (combinations (x, 2)), 1)
temp3 = temp2.apply (lambda x: pd.Series (x))
temp4 = temp3.stack ().reset_index (level = [0, 1])
del temp4['level_1']
temp4.columns = ['UT', 'pair']
temp4[~temp4.pair.str.contains('nan')]
但是,运行这段代码后,我收到一条错误消息:
TypeError:输入类型不支持ufunc'invert',并且根据强制转换规则“ safe”不能将输入安全地强制转换为任何受支持的类型
如何解决此错误?
答案 0 :(得分:0)
对于大熊猫0.25,可以使用Series.explode
并通过NaNs
过滤掉combinations
中的list comprehension with filter
技巧,因为它np.NaN != np.NaN
通过definition:
df = pd.DataFrame([['I1', 'Apple', 'Orange', 'Peach'],
['I2', 'Apple', 'Lemon', np.NAN],
['I3', 'Starfruit', 'Apple', 'Orange']],
columns=['UT', 'Fruit_1', 'Fruit_2', 'Fruit_3'])
temp4 = (df.set_index ('UT')
.apply (lambda x: list (combinations ([y for y in x if y == y], 2)), 1)
.explode()
.reset_index(name='pair'))
print (temp2)
UT pair
0 I1 (Apple, Orange)
1 I1 (Apple, Peach)
2 I1 (Orange, Peach)
3 I2 (Apple, Lemon)
4 I3 (Starfruit, Apple)
5 I3 (Starfruit, Orange)
6 I3 (Apple, Orange)
对于较早的熊猫版本:
temp4 = (df.set_index ('UT')
.stack()
.groupby(level=0)
.apply(lambda x: pd.Series(list(combinations (x, 2))))
.reset_index(level=1, drop=True)
.reset_index(name='pair'))
print (temp4)
UT pair
0 I1 (Apple, Orange)
1 I1 (Apple, Peach)
2 I1 (Orange, Peach)
3 I2 (Apple, Lemon)
4 I3 (Starfruit, Apple)
5 I3 (Starfruit, Orange)
6 I3 (Apple, Orange)