Question

Given是具有“关系数据”的典型熊猫数据框。

|--------------|------------|------------|
|   Column1    |  Column2   |  Column3   |
|-------- -----|------------|------------|
|    A         |      1     |     C      |
|--------------|------------|------------|
|    B         |      2     |     C      |
|--------------|------------|------------|
|    A         |      2     |     C      |
|--------------|------------|------------|
|    A         |      1     |     C      |
|--------------|------------|------------|
|    ...       |    ...     |    ...     |
|--------------|------------|------------|

我正在尝试计算长度为2的所有列值之间的概率，这意味着元组(A,1) --> 0.66，(A,2) --> 0.33，(B,2) --> 1，(2,B) --> 0.5等。

我希望结果返回类似以下的列表：

[
   [A,1,0.66],
   [A,2,0.33],
   [B,2,1],
   [2,b,0.5],
   ...
]

当前，我的方法确实效率低下（即使在使用多处理时）。简化后，我在没有Cython的情况下迭代所有可能性。

# iterating through all columns
for colname in colnames: 
    # evaluating all other columns except the one under assessment
    for x in [x for x in colnames if not x==colname]:
        # through groupby we get their counts
        groups = df.groupby([colname,x]).size().reset_index(name='counts')
        # for each group we
        for index,row in groups.iterrows():
            # calculate their probability over the entire population
            # of the column and push it in the result list
            result.append([row[colname],row[x],(row["counts"]/df[x].count())])

完成此转换的最有效方法是什么？

将具有关系数据的熊猫数据框转换为概率链接的最有效方法

0 个答案: