我有UCI成人数据集,我需要计算在给定范围内具有每个特征的样本数。我曾尝试使用multiprocess.pool,multiprocess.starmap和下面的代码,但是8个线程上的23.040.000组合在2小时内没有完成,我需要让脚本运行得非常快,因为它是需要参数调整的管道部分。
dataset[(combination[0][0] <= dataset["age"]) &
(dataset["age"] <= combination[0][1]) &
(combination[1][0] <= dataset["work_class"]) &
(dataset["work_class"] <= combination[1][1]) &
(combination[2][0] <= dataset["fnlwgt"]) &
(dataset["fnlwgt"] <= combination[2][1]) &
(combination[3][0] <= dataset["education_num"]) &
(dataset["education_num"] <= combination[3][1]) &
(combination[4][0] <= dataset["martial_status"]) &
(dataset["martial_status"] <= combination[4][1]) &
(combination[5][0] <= dataset["occupation"]) &
(dataset["occupation"] <= combination[5][1]) &
(combination[6][0] <= dataset["relationship"]) &
(dataset["relationship"] <= combination[6][1]) &
(combination[7][0] <= dataset["race"]) &
(dataset["race"] <= combination[7][1]) &
(combination[8][0] <= dataset["sex"]) &
(dataset["sex"] <= combination[8][1]) &
(combination[9][0] <= dataset["capital_gain"]) &
(dataset["capital_gain"] <= combination[9][1]) &
(combination[10][0] <= dataset["capital_loss"]) &
(dataset["capital_loss"] <= combination[10][1]) &
(combination[11][0] <= dataset["hours_per_week"]) &
(dataset["hours_per_week"] <= combination[11][1]) &
(combination[12][0] <= dataset["native_country"]) &
(dataset["native_country"] <= combination[12][1]) &
(combination[13][0] <= dataset["money_earned"]) &
(dataset["money_earned"] <= combination[13][1])
].shape[0]
我有115.200.000种组合,数据集(训练+测试)有大约45,000个条目。
对于数据集示例,请查看https://archive.ics.uci.edu/ml/datasets/adult
为了能够使用&lt; =我已经使用以下字典在Pandas中使用Dataframe.replace函数转换了数值中的字符串值。
work_class_conversion_dict = {
"Private": 0,
"Self-emp-not-inc": 1,
"Self-emp-inc": 2,
"Federal-gov": 3,
"Local-gov": 4,
"State-gov": 5,
"Without-pay": 6,
"Never-worked": 7
}
martial_status_conversion_dict = {
"Married-civ-spouse": 0,
"Married-spouse-absent": 1,
"Married-AF-spouse": 2,
"Divorced": 3,
"Never-married": 4,
"Separated": 5,
"Widowed": 6
}
occupation_conversion_dict = {
"Tech-support": 0,
"Craft-repair": 1,
"Sales": 2,
"Exec-managerial": 3,
"Prof-specialty": 4,
"Handlers-cleaners": 5,
"Machine-op-inspct": 6,
"Adm-clerical": 7,
"Farming-fishing": 8,
"Transport-moving": 9,
"Priv-house-serv": 10,
"Protective-serv": 11,
"Armed-Forces": 12,
"Other-service": 13
}
relationship_conversion_dict = {
"Wife": 0,
"Husband": 1,
"Not-in-family": 2,
"Own-child": 3,
"Other-relative": 4,
"Unmarried": 5
}
race_conversion_dict = {
"White": 0,
"Asian-Pac-Islander": 1,
"Amer-Indian-Eskimo": 2,
"Other": 3,
"Black": 4
}
sex_conversion_dict = {
"Female": 0,
"Male": 1
}
native_country_conversion_dict = {
"United-States": 0,
"China": 1,
"Japan": 2,
"Germany": 3,
"France": 4,
"India": 5,
"Italy": 6,
"Canada": 7,
"South": 8,
"Mexico": 9,
"Holand-Netherlands": 10,
"Taiwan": 11,
"Poland": 12,
"Thailand": 13,
"Iran": 14,
"Hong": 15,
"Philippines": 16,
"Ireland": 17,
"Columbia": 18,
"Portugal": 19,
"Vietnam": 20,
"Peru": 21,
"Greece": 22,
"Hungary": 23,
"Puerto-Rico": 24,
"Ecuador": 25,
"Dominican-Republic": 26,
"Guatemala": 27,
"El-Salvador": 28,
"Honduras": 29,
"Trinadad&Tobago": 30,
"Cambodia": 31,
"Jamaica": 32,
"Laos": 33,
"Nicaragua": 34,
"Haiti": 35,
"England": 36,
"Outlying-US(Guam-USVI-etc)": 37,
"Cuba": 38,
"Scotland": 39,
"Yugoslavia": 40
}
money_earned_conversion_dict = {
"<=50K": 0,
">50K": 1
}
conversion_dict = {
"work_class": work_class_conversion_dict,
"martial_status": martial_status_conversion_dict,
"relationship": relationship_conversion_dict,
"occupation": occupation_conversion_dict,
"race": race_conversion_dict,
"sex": sex_conversion_dict,
"native_country": native_country_conversion_dict,
"money_earned": money_earned_conversion_dict
}
您在第一段代码中看到的组合元素是以下产品的单个元素。
age_bins = [(15, 25), (26, 35), (36, 45), (46, 55), (55, np.inf)]
work_class_bins = [(0, 0), (1, 2), (3, 5), (6, 7)]
education_num_bins = [(1, 4), (5, 8), (9, 12), (13, 16)]
martial_status_bins = [(0, 2), (3, 6)]
occupation_bins = [(0, 1), (2, 4), (5, 7), (8, 9), (10, 13)]
relationship_bins = [(0, 1), (2, 2), (3, 5)]
race_bins = [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
sex_bins = [(0, 0), (1, 1)]
capital_gain_bins = [(0, 0), (0, 1000), (1001, 5000), (5001, np.inf)]
capital_loss_bins = [(0, 0), (1, 1000), (1001, np.inf)]
hours_per_week_bins = [(0, 20), (21, 40), (41, 60), (60, np.inf)]
native_country_bins = [(0, 1), (2, 9), (10, 24), (25, 35), (36, 40)]
money_earned_bins = [(0, 0), (1, 1)]
fnlwgt_bins = []
fnlwgt_max = dataset["fnlwgt"].max(axis=0)
fnlwgt_min = dataset["fnlwgt"].min(axis=0)
for i in range(1, 11):
fnlwgt_bins.append((((i - 1) * (fnlwgt_max - fnlwgt_min) / 10), (i * (fnlwgt_max - fnlwgt_min) / 10)))
combinations = product(
age_bins,
work_class_bins,
fnlwgt_bins,
education_num_bins,
martial_status_bins,
occupation_bins,
relationship_bins,
race_bins,
sex_bins,
capital_gain_bins,
capital_loss_bins,
hours_per_week_bins,
native_country_bins,
money_earned_bins
)
在此处找到处理后的数据集示例:https://pastebin.com/tSH4UYXc。
过滤计数值稍后将用于差异隐私算法。
您对如何加快速度有任何想法吗?
你认为如果我充分利用Numpy它会更快吗?
我应该尝试做numpy.vectorize吗?
请解释解决方案,因为我是Pandas的新手,我想了解如何编写更高效的代码。
干杯,
丹