我有一个具有以下结构的数据框:
>>> df
ID Class Type
0 1 Math Calculus
1 1 Math Algebra
2 1 Science Physics
3 1 History American
4 2 Math Factorization
5 2 History European
6 2 Science Chemistry
7 2 Science Biology
8 3 Math Computation
9 3 Science Biology
所需的输出是一种结构,该结构将每个ID的ID映射到Class,将Class映射到Type。
例如:
{
1: {Math: [Calculus, Algebra], Science: [Physics], History: [American]}
2: {Math: [Factorization], History: [European], Science: [Chemistry, Biology]}
3: {Math: [Computation], Science: [Biology]}
}
我可以使用for循环来完成此操作,但是数据集非常大(大约3000万行),所以我想使用Pandas完成此操作
我能够获得格式正确的单个ID的输出
>>> df.groupby(['ID', 'Class'])['Type'].apply(lambda x: x.to_dict())[1].groupby('Class').apply(lambda x: x.to_list()).to_dict()
{'History': ['American'], 'Math': ['Calculus', 'Algebra'], 'Science': ['Physics']}
>>> df.groupby(['ID', 'Class'])['Type'].apply(lambda x: x.to_dict())[2].groupby('Class').apply(lambda x: x.to_list()).to_dict()
{'History': ['European'], 'Math': ['Factorization'], 'Science': ['Chemistry', 'Biology']}
如何将以上逻辑应用于所有ID,还有没有更简单的方法?我认为我嵌套了太多groupby,使问题复杂化了,但不确定如何更有效地进行操作
答案 0 :(得分:1)
IIUC,您可以尝试使用此游戏:
import pandas as pd
txt="""0 1 Math Calculus
1 1 Math Algebra
2 1 Science Physics
3 1 History American
4 2 Math Factorization
5 2 History European
6 2 Science Chemistry
7 2 Science Biology
8 3 Math Computation
9 3 Science Biology"""
txt = [list(filter(lambda a: a != '', t.split(" ")))[1:]
for t in txt.split("\n")]
df = pd.DataFrame(txt, columns=["ID", 'Class', 'Type'])
df["ID"] = df["ID"].astype(int)
out = df.groupby("ID")\
.apply(lambda x: x.groupby("Class")\
.apply(lambda y:y["Type"].tolist()).to_dict())
返回
ID
1 {'History': ['American'], 'Math': ['Calculus',...
2 {'History': ['European'], 'Math': ['Factorization',...
3 {'Math': ['Computation'], 'Science': ['Biology']}
dtype: object
现在您可以通过out[1]["Math"]
(返回['Calculus', 'Algebra']