Python熊猫爆炸(一对多关系)

时间:2020-02-14 20:24:55

标签: python python-3.x pandas dataframe pandas-groupby

假设我有一个以下数据框,其中包含名称,偏好,水果列:

name   preference   fruits
adam    likes       apples
mike   dislikes     orange

如果上面的数据框具有一对多的关系,例如列名称,则将与列首选项,水果有多个关系。例如,我正在寻找的输出数据框是:

name   preference   fruits
adam    likes       apples
adam    likes       orange
adam    dislikes    apple
adam    dislikes    orange
mike    likes       apples
mike    likes       orange
mike    dislikes    apple
mike    dislikes    orange

想知道是否有可能。根据到目前为止对熊猫的了解,我相信我将不得不使用groupby吗? 任何帮助表示赞赏! 谢谢!

2 个答案:

答案 0 :(得分:2)

它只是跨产品吗?

(pd.MultiIndex.from_product([df[col] for col in df],
                           names=df.columns)
   .to_frame().reset_index(drop=True)
)

输出:

   name preference  fruits
0  adam      likes  apples
1  adam      likes  orange
2  adam   dislikes  apples
3  adam   dislikes  orange
4  mike      likes  apples
5  mike      likes  orange
6  mike   dislikes  apples
7  mike   dislikes  orange

答案 1 :(得分:0)

我会使用itertools.product

import pandas as pd
from itertools import product


df = pd.DataFrame({
    'name': ['adam', 'mike'],
    'preference': ['likes', 'dislikes'],
    'fruits': ['apples', 'oranges']
})

ndf = pd.DataFrame(
    product(*[df[c] for c in df.columns]),
    columns=df.columns
)

print(ndf)
#    name preference   fruits
# 0  adam      likes   apples
# 1  adam      likes  oranges
# 2  adam   dislikes   apples
# 3  adam   dislikes  oranges
# 4  mike      likes   apples
# 5  mike      likes  oranges
# 6  mike   dislikes   apples
# 7  mike   dislikes  oranges

关于速度,这似乎也要快一点。

%%timeit
ndf = pd.DataFrame(
    product(*[df[c] for c in df.columns]),
    columns=df.columns
)
# 624 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


%%timeit
(pd.MultiIndex.from_product([df[col] for col in df],
                           names=df.columns)
   .to_frame().reset_index(drop=True)
)
# 3.51 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)