我正在尝试获取数据帧八列之间每种可能组合的计数(所有行值为1)。基本上我需要了解有多少次不同的重叠存在。
我尝试使用itertools.product
来获取所有组合,但似乎不起作用。
import pandas as pd
import numpy as np
import itertools
df = pd.read_excel('filename.xlsx')
df.head(15)
a b c d e f g h
0 1 0 0 0 0 1 0 0
1 1 0 0 0 0 0 0 0
2 1 0 1 1 1 1 1 1
3 1 0 1 1 0 1 1 1
4 1 0 0 0 0 0 0 0
5 0 1 0 0 1 1 1 1
6 1 1 0 0 1 1 1 1
7 1 1 1 1 1 1 1 1
8 1 1 0 0 1 1 0 0
9 1 1 1 0 1 0 1 0
10 1 1 1 0 1 1 0 0
11 1 0 0 0 0 1 0 0
12 1 1 1 1 1 1 1 1
13 1 1 1 1 1 1 1 1
14 0 1 1 1 1 1 1 0
print(list(itertools.product(new_df.columns)))
预期输出将是一个数据帧,其中包含每个有效组合的行数(n)(其中行中的值均为1)。
例如:
a b
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 0 1
6 1 1
7 1 1
8 1 1
9 1 1
10 1 1
11 1 0
12 1 1
13 1 1
14 0 1
会给
combination count
a 12
a_b 7
b 9
请注意,输出将需要包含a
和h
之间的所有可能组合,而不仅仅是成对
答案 0 :(得分:5)
使用powerset
食谱,
s = pd.Series({
'_'.join(c): df[c].min(axis=1).sum()
for c in map(list, filter(None, powerset(df)))
})
a 13
b 9
c 8
d 6
e 10
f 12
g 9
h 7
a_b 7
...
这是特例,可以向量化。
from itertools import combinations
u = df.T.dot(df)
pd.DataFrame({
'combination': [*map('_'.join, combinations(df, 2))],
# pandas < 0.24
# 'count': u.values[np.triu_indices_from(u, k=1)]
# pandas >= 0.24
'count': u.to_numpy()[np.triu_indices_from(u, k=1)]
})
您可以使用dot
,然后提取上三角矩阵值:
combination count
0 a_b 7
1 a_c 7
2 a_d 5
3 a_e 8
4 a_f 10
5 a_g 7
6 a_h 6
7 b_c 6
8 b_d 4
9 b_e 9
答案 1 :(得分:0)
如果只有值1和0,则可以执行以下操作:
df= pd.DataFrame({
'a': [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1],
'b': [1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0],
'c': [1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1],
'd': [1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1],
})
(df.a * df.b).sum()
结果为4。
要获取所有组合,可以使用combinations
中的itertools
:
from itertools import combinations
analyze=[(col,) for col in df.columns]
analyze.extend(combinations(df.columns, 2))
for cols in analyze:
num_ser= None
for col in cols:
if num_ser is None:
num_ser= df[col]
else:
num_ser*= df[col]
num= num_ser.sum()
print(f'{cols} contains {num}')
结果是:
('a',) contains 4
('b',) contains 7
('c',) contains 11
('d',) contains 23
('a', 'b') contains 4
('a', 'c') contains 4
('a', 'd') contains 4
('b', 'c') contains 7
('b', 'd') contains 7
('c', 'd') contains 11
答案 2 :(得分:0)
共同矩阵是您所需要的:
让我们首先构造一个示例:
import numpy as np
import pandas as pd
mat = np.zeros((5,5))
mat[0,0] = 1
mat[0,1] = 1
mat[1,0] = 1
mat[2,1] = 1
mat[3,3] = 1
mat[3,4] = 1
mat[2,4] = 1
cols = ['a','b','c','d','e']
df = pd.DataFrame(mat,columns=cols)
print(df)
a b c d e
0 1.0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0 1.0
3 0.0 0.0 0.0 1.0 1.0
4 0.0 0.0 0.0 0.0 0.0
现在我们构造共现矩阵:
# construct the cooccurence matrix:
co_df = df.T.dot(df)
print(co_df)
a b c d e
a 2.0 1.0 0.0 0.0 0.0
b 1.0 2.0 0.0 0.0 1.0
c 0.0 0.0 0.0 0.0 0.0
d 0.0 0.0 0.0 1.0 1.0
e 0.0 1.0 0.0 1.0 2.0
最后需要的结果:
result = {}
for c1 in cols:
for c2 in cols:
if c1 == c2:
if c1 not in result:
result[c1] = co_df[c1][c2]
else:
if '_'.join([c1,c2]) not in result:
result['_'.join([c1,c2])] = co_df[c1][c2]
print(result)
{'a': 2.0, 'a_b': 1.0, 'a_c': 0.0, 'a_d': 0.0, 'a_e': 0.0, 'b_a': 1.0, 'b': 2.0, 'b_c': 0.0, 'b_d': 0.0, 'b_e': 1.0, 'c_a': 0.0, 'c_b': 0.0, 'c': 0.0, 'c_d': 0.0, 'c_e': 0.0, 'd_a': 0.0, 'd_b': 0.0, 'd_c': 0.0, 'd': 1.0, 'd_e': 1.0, 'e_a': 0.0, 'e_b': 1.0, 'e_c': 0.0, 'e_d': 1.0, 'e': 2.0}
答案 3 :(得分:0)
当您碰巧有8列时,np.packbits
和
np.bincount
在这里很方便:
import numpy as np
import pandas as pd
# make large example
ncol, nrow = 8, 1_000_000
df = pd.DataFrame(np.random.randint(0,2,(nrow,ncol)), columns=list("abcdefgh"))
from time import time
T = [time()]
# encode as binary numbers and count
counts = np.bincount(np.packbits(df.values.astype(np.uint8)),None,256)
# find sets in other sets
rng = np.arange(256, dtype=np.uint8)
contained = (rng & rng[:, None]) == rng[:, None]
# and sum
ccounts = (counts * contained).sum(1)
# if there are empty bins, remove them
nz = np.where(ccounts)[0].astype(np.uint8)
# helper to build bin labels
a2h = np.array(list("abcdefgh"))
# put labels to counts
result = pd.Series(ccounts[nz], index = ["_".join((*a2h[np.unpackbits(i).view(bool)],)) for i in nz])
from itertools import chain, combinations
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
T.append(time())
s = pd.Series({
'_'.join(c): df[c].min(axis=1).sum()
for c in map(list, filter(None, powerset(df)))
})
T.append(time())
print("packbits {:.3f} powerset {:.3f}".format(*np.diff(T)))
print("results equal", (result.sort_index()[1:]==s.sort_index()).all())
这与powerset方法产生的结果相同,但实际上要快1000倍:
packbits 0.016 powerset 21.974
results equal True