假设我有一个以下数据框,其中包含名称,偏好,水果列:
name preference fruits
adam likes apples
mike dislikes orange
如果上面的数据框具有一对多的关系,例如列名称,则将与列首选项,水果有多个关系。例如,我正在寻找的输出数据框是:
name preference fruits
adam likes apples
adam likes orange
adam dislikes apple
adam dislikes orange
mike likes apples
mike likes orange
mike dislikes apple
mike dislikes orange
想知道是否有可能。根据到目前为止对熊猫的了解,我相信我将不得不使用groupby吗? 任何帮助表示赞赏! 谢谢!
答案 0 :(得分:2)
它只是跨产品吗?
(pd.MultiIndex.from_product([df[col] for col in df],
names=df.columns)
.to_frame().reset_index(drop=True)
)
输出:
name preference fruits
0 adam likes apples
1 adam likes orange
2 adam dislikes apples
3 adam dislikes orange
4 mike likes apples
5 mike likes orange
6 mike dislikes apples
7 mike dislikes orange
答案 1 :(得分:0)
import pandas as pd
from itertools import product
df = pd.DataFrame({
'name': ['adam', 'mike'],
'preference': ['likes', 'dislikes'],
'fruits': ['apples', 'oranges']
})
ndf = pd.DataFrame(
product(*[df[c] for c in df.columns]),
columns=df.columns
)
print(ndf)
# name preference fruits
# 0 adam likes apples
# 1 adam likes oranges
# 2 adam dislikes apples
# 3 adam dislikes oranges
# 4 mike likes apples
# 5 mike likes oranges
# 6 mike dislikes apples
# 7 mike dislikes oranges
关于速度,这似乎也要快一点。
%%timeit
ndf = pd.DataFrame(
product(*[df[c] for c in df.columns]),
columns=df.columns
)
# 624 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
(pd.MultiIndex.from_product([df[col] for col in df],
names=df.columns)
.to_frame().reset_index(drop=True)
)
# 3.51 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)