我有一个巨大的cooccurence矩阵,其索引为skill_id,列名称为skill_id,矩阵填充了相同的共同出现。请在下面找到示例
我想要3列数据框中的数据:skillid1 skillid2计数 任何帮助将受到高度赞赏。
答案 0 :(得分:0)
from itertools import combinations
weights = []`
for skill_id in skills.skill_id:
if str(skill_id) in count_model.vocabulary_.keys():
i = count_model.vocabulary_[str(skill_id)]
j = count_model.vocabulary_[str(skill_id)]
if (skills_occurrences[i][j] > 0) and () :
weights.append([skill_id, skill_id, skills_occurrences[i][j]])
for combination in combinations(skills.skill_id, 2):
if str(combination[0]) in count_model.vocabulary_.keys() and str(combination[1]) in count_model.vocabulary_.keys():
i = count_model.vocabulary_[str(combination[0])]
j = count_model.vocabulary_[str(combination[1])]
if skills_occurrences[i][j] > 0:
weights.append([str(combination[0]), str(combination[1]), skills_occurrences[i][j]])
还有一个数据集需要处理,之后只是嵌套循环两个技能并比较它们并继续在索引中附加值和值。
答案 1 :(得分:0)
假设你的共生矩阵被称为df
,看起来像这样:
4044 4092 4651 6168 6229 6284 6295
4044 0 0 0 1 1 0 0
4092 0 0 1 0 0 0 0
4651 0 1 0 0 0 0 0
6168 1 0 0 0 1 0 0
6229 1 0 0 1 0 0 0
6284 0 0 0 0 0 0 1
6295 0 0 0 0 0 1 0
我建议如下:
import itertools
# get all possible pairs of (skillid1, skillid2)
edges = list(itertools.combinations(df.columns, 2))
# find associated weights in the original df
edges_with_weights = [(node1, node2, df.loc[node1][node2]) for (node1, node2) in edges]
# put it all in a new dataframe
new_df = pd.DataFrame(vertices_with_weights, columns=["skillid1", "skillid2", "count"])
现在您已经拥有了所需的new_df
:
skillid1 skillid2 count
0 4044 4092 0
1 4044 4651 0
2 4044 6168 1
3 4044 6229 1
4 4044 6284 0
5 4044 6295 0
6 4092 4651 1
7 4092 6168 0
...
...
...