Question

我有一个像这样的pandas数据框：

doc type    thing
3   A   pig
4   B   horse
4   C   cat
4   D   pig
5   C   horse
5   A   bird
5   B   cat

我想要一个新的数据框，其中包含三列（事物;事物;时间），这些列由同一个“doc”中出现过的所有“事物”对填充，以及有多少文档。根据上面的数据框，所需的输出将是：

thing   thing   times
horse   cat     2
horse   pig     1
cat pig         1
horse   bird    1
bird    cat     1

我在pandas之外用itertools在这些方面取得了一些成就，但是如何用pandas来完成呢？

Answer 1

可能的解决方案：

df_filtered = df[['doc', 'thing']]
pd.merge(df_filtered, df_filtered, on='doc')
    .query("thing_x < thing_y")
    .groupby(by=['thing_x', 'thing_y'])
    .agg({'doc': 'nunique'})
    .reset_index()

首先，您可以使用pandas.merge()生成具有相同文档的所有行的笛卡尔积，不包括具有相反顺序的重复条目，以及其中thing_x == thing_y的条目。通过这种方式，你得到一个这样的表：

    doc thing_x thing_y
5   4   horse   pig
6   4   cat     horse
8   4   cat     horse
10  4   cat     pig
15  4   horse   pig
16  4   cat     horse
18  4   cat     horse
20  4   cat     pig
29  5   bird    horse
31  5   bird    cat
32  5   cat     horse

然后.groupby() thing两个thing_x thing_y doc 0 bird cat 1 1 bird horse 1 2 cat horse 2 3 cat pig 1 4 horse pig 1个，计算每个组的不同文档数量，并调用.reset_index()以平展分层分组。

最终结果：

sudo fuser -k 27017/tcp

基于Pandas数据帧的加权对列表

1 个答案: