我有一个这样的数据框:
id a b c d e
0 a10 a11 a12 a13 a14
1 a10 a21 a12 a23 a24
2 a30 a21 a12 a33 a14
3 a30 a21 a12 a43 a44
4 a10 a51 a12 a53 a14
,我想从数据帧中获得所有长度为'x'的组合的唯一列表。如果length为3,则某些组合为:
[[a10,a11,a12],[a10,a21,a12],[a10,a51,a12],[a30,a11,a12],[a30,a21,a12],[a30,a51,a12],
[a11,a12,a13],[a11,a12,a23],[a11,a12,a33],[a11,a12,a43],[a11,a12,a53],[a21,a12,a13]....]
只有2个约束条件:
1. Length of combination lists should be equal to the 'x'
2. In one combination, there can be at max only 1 unique value from a column of dataframe.
下面给出了构成数据帧的最小代码段。任何帮助都感激不尽。谢谢!
data_dict={'a':['a10','a10','a30','a30','a10'],
'b':['a11','a21','a21','a21','a51'],
'c':['a12','a12','a12','a12','a12'],
'd':['a13','a23','a33','a43','a53'],
'e':['a14','a24','a14','a44','a14']}
df1=pd.DataFrame(data_dict)
答案 0 :(得分:3)
要获取每列的唯一值:
aa = [list(product(np.unique(df1[col1]),
np.unique(df1[col2]),
np.unique(df1[col3])))
for col1, col2, col3 in list(combinations(df1.columns, 3))]
旧答案
首先我们使用np.flatten
将矩阵展平为一维数组,并使用np.unique
获得唯一值,然后使用itertools.combinations
:
from itertools import combinations
a = np.unique(df1.to_numpy().flatten())
aa = set(combinations(a, 3))
{('a10', 'a11', 'a12'),
('a10', 'a11', 'a13'),
('a10', 'a11', 'a14'),
('a10', 'a11', 'a21'),
('a10', 'a11', 'a23'),
('a10', 'a11', 'a24'),
('a10', 'a11', 'a30'),
('a10', 'a11', 'a33'),
('a10', 'a11', 'a43'),
('a10', 'a11', 'a44'),
('a10', 'a11', 'a51'),
('a10', 'a11', 'a53'),
('a10', 'a12', 'a13'),
('a10', 'a12', 'a14'),
...
或者实际获取列表(效率较低):
from itertools import combinations
a = np.unique(df1.to_numpy().flatten())
aa = [list(x) for x in set(combinations(a, 3))]
[['a12', 'a33', 'a51'],
['a11', 'a12', 'a13'],
['a10', 'a11', 'a21'],
['a10', 'a23', 'a24'],
['a12', 'a14', 'a24'],
['a14', 'a43', 'a53'],
['a11', 'a21', 'a53'],
['a10', 'a12', 'a24'],
['a12', 'a21', 'a44'],
['a12', 'a30', 'a51'],
['a14', 'a23', 'a30'],
...
答案 1 :(得分:2)
将combinations
用于过滤由set
的每一列创建的DateFrame
的第二种情况:
from itertools import combinations
L = [set(df[x]) for x in df]
a = [x for x in combinations(np.unique(df.values.ravel()), 3)
if all(len(set(x).intersection(y)) < 2 for y in L)]