从共生矩阵到边缘列表数据分离

时间:2018-02-15 18:41:45

标签: python list

我有一个巨大的cooccurence矩阵,其索引为skill_id,列名称为skill_id,矩阵填充了相同的共同出现。请在下面找到示例 cooccurrence matrix data

我想要3列数据框中的数据:skillid1 skillid2计数 任何帮助将受到高度赞赏。

2 个答案:

答案 0 :(得分:0)

from itertools import combinations
weights = []`
for skill_id in skills.skill_id:
if str(skill_id) in count_model.vocabulary_.keys():
    i = count_model.vocabulary_[str(skill_id)]
    j = count_model.vocabulary_[str(skill_id)]
    if (skills_occurrences[i][j] > 0) and () :
        weights.append([skill_id, skill_id, skills_occurrences[i][j]])
for combination in combinations(skills.skill_id, 2):
if str(combination[0]) in count_model.vocabulary_.keys() and str(combination[1]) in count_model.vocabulary_.keys():
    i = count_model.vocabulary_[str(combination[0])]
    j = count_model.vocabulary_[str(combination[1])]
    if skills_occurrences[i][j] > 0:
        weights.append([str(combination[0]), str(combination[1]), skills_occurrences[i][j]])

还有一个数据集需要处理,之后只是嵌套循环两个技能并比较它们并继续在索引中附加值和值。

答案 1 :(得分:0)

假设你的共生矩阵被称为df,看起来像这样:

      4044 4092 4651 6168 6229 6284 6295
4044    0    0    0    1    1   0    0
4092    0    0    1    0    0   0    0
4651    0    1    0    0    0   0    0
6168    1    0    0    0    1   0    0
6229    1    0    0    1    0   0    0
6284    0    0    0    0    0   0    1
6295    0    0    0    0    0   1    0

我建议如下:

import itertools

# get all possible pairs of (skillid1, skillid2)
edges = list(itertools.combinations(df.columns, 2))  

# find associated weights in the original df
edges_with_weights = [(node1, node2, df.loc[node1][node2]) for (node1, node2) in edges]

# put it all in a new dataframe
new_df = pd.DataFrame(vertices_with_weights, columns=["skillid1", "skillid2", "count"])  

现在您已经拥有了所需的new_df

    skillid1    skillid2    count
0   4044          4092        0
1   4044          4651        0
2   4044          6168        1
3   4044          6229        1
4   4044          6284        0
5   4044          6295        0
6   4092          4651        1
7   4092          6168        0
...
...
...