以下用于实现功能expand_grid()的网络上的example包含三个变量:身高(2个类别),体重(3个类别),性别(2个类别),总计2 * 3 * 2 = 12个类别。
df={'height': [60, 70],
'weight': [100, 140, 180],
'sex': ['Male', 'Female']}
在上述对象上运行expand_grid
expand_grid(df)
产生以下结果:
sex weight height
0 Male 100 60
1 Male 100 70
2 Male 140 60
3 Male 140 70
4 Male 180 60
5 Male 180 70
6 Female 100 60
7 Female 100 70
8 Female 140 60
9 Female 140 70
10 Female 180 60
11 Female 180 70
我想对具有以下列(类别)的数据集执行相同的操作:
种族(9),婚姻状况(3),性别(2),年龄(2),西班牙裔(2)。
那是9 * 3 * 2 * 2 * 2 = 216个类别。
我想要以下内容:
Race Marital_Status Sex Age Hispanic
0 White Married Male Under_18 Hispanic
1 White Married Male Under_18 Non-Hispanic
2 White Married Male Over_18 Hispanic
3 White Married Male Over_18 Non-Hispanic
4 White Married Male Over_18 Hispanic
5 White Married Female Under_18 Hispanic
.
.
.
216 Asian Single Female Over_18 Non-Hispanic
当我尝试运行expand_grid()时,系统内存不足。
有人告诉我,如果有一种方法可以让Python事先识别出数据类型(例如列表,向量等),那将更快并且计算上也更便宜。有可行的方法来实现这一点吗?
非常感谢!
答案 0 :(得分:0)
PSL itertools包可以完成这项工作。
import itertools
import pandas as pd
cat = {
'C1': ['A', 'B', 'C'],
'C2': ['A', 'B'],
'C3': ['A', 'B', 'C', 'D']
}
order = cat.keys()
pd.DataFrame(itertools.product(*[cat[k] for k in order]), columns=order)
它使用类别模式的所有可能组合(笛卡尔积)创建一个DataFrame:
C1 C2 C3
0 A A A
1 A A B
2 A A C
[...]
22 C B C
23 C B D