加快Pandas列过滤

时间:2017-07-17 16:27:27

标签: python pandas numpy

我有UCI成人数据集,我需要计算在给定范围内具有每个特征的样本数。我曾尝试使用multiprocess.pool,multiprocess.starmap和下面的代码,但是8个线程上的23.040.000组合在2小时内没有完成,我需要让脚本运行得非常快,因为它是需要参数调整的管道部分。

dataset[(combination[0][0] <= dataset["age"]) &
        (dataset["age"] <= combination[0][1]) &
        (combination[1][0] <= dataset["work_class"]) &
        (dataset["work_class"] <= combination[1][1]) &
        (combination[2][0] <= dataset["fnlwgt"]) &
        (dataset["fnlwgt"] <= combination[2][1]) &
        (combination[3][0] <= dataset["education_num"]) &
        (dataset["education_num"] <= combination[3][1]) &
        (combination[4][0] <= dataset["martial_status"]) &
        (dataset["martial_status"] <= combination[4][1]) &
        (combination[5][0] <= dataset["occupation"]) &
        (dataset["occupation"] <= combination[5][1]) &
        (combination[6][0] <= dataset["relationship"]) &
        (dataset["relationship"] <= combination[6][1]) &
        (combination[7][0] <= dataset["race"]) &
        (dataset["race"] <= combination[7][1]) &
        (combination[8][0] <= dataset["sex"]) &
        (dataset["sex"] <= combination[8][1]) &
        (combination[9][0] <= dataset["capital_gain"]) &
        (dataset["capital_gain"] <= combination[9][1]) &
        (combination[10][0] <= dataset["capital_loss"]) &
        (dataset["capital_loss"] <= combination[10][1]) &
        (combination[11][0] <= dataset["hours_per_week"]) &
        (dataset["hours_per_week"] <= combination[11][1]) &
        (combination[12][0] <= dataset["native_country"]) &
        (dataset["native_country"] <= combination[12][1]) &
        (combination[13][0] <= dataset["money_earned"]) &
        (dataset["money_earned"] <= combination[13][1])
       ].shape[0]

我有115.200.000种组合,数据集(训练+测试)有大约45,000个条目。

对于数据集示例,请查看https://archive.ics.uci.edu/ml/datasets/adult

为了能够使用&lt; =我已经使用以下字典在Pandas中使用Dataframe.replace函数转换了数值中的字符串值。

work_class_conversion_dict = {
    "Private": 0,
    "Self-emp-not-inc": 1,
    "Self-emp-inc": 2,
    "Federal-gov": 3,
    "Local-gov": 4,
    "State-gov": 5,
    "Without-pay": 6,
    "Never-worked": 7
}

martial_status_conversion_dict = {
    "Married-civ-spouse": 0,
    "Married-spouse-absent": 1,
    "Married-AF-spouse": 2,
    "Divorced": 3,
    "Never-married": 4,
    "Separated": 5,
    "Widowed": 6
}

occupation_conversion_dict = {
    "Tech-support": 0,
    "Craft-repair": 1,
    "Sales": 2,
    "Exec-managerial": 3,
    "Prof-specialty": 4,
    "Handlers-cleaners": 5,
    "Machine-op-inspct": 6,
    "Adm-clerical": 7,
    "Farming-fishing": 8,
    "Transport-moving": 9,
    "Priv-house-serv": 10,
    "Protective-serv": 11,
    "Armed-Forces": 12,
    "Other-service": 13
}

relationship_conversion_dict = {
    "Wife": 0,
    "Husband": 1,
    "Not-in-family": 2,
    "Own-child": 3,
    "Other-relative": 4,
    "Unmarried": 5
}

race_conversion_dict = {
    "White": 0,
    "Asian-Pac-Islander": 1,
    "Amer-Indian-Eskimo": 2,
    "Other": 3,
    "Black": 4
}

sex_conversion_dict = {
    "Female": 0,
    "Male": 1
}

native_country_conversion_dict = {
    "United-States": 0,
    "China": 1,
    "Japan": 2,
    "Germany": 3,
    "France": 4,
    "India": 5,
    "Italy": 6,
    "Canada": 7,
    "South": 8,
    "Mexico": 9,
    "Holand-Netherlands": 10,
    "Taiwan": 11,
    "Poland": 12,
    "Thailand": 13,
    "Iran": 14,
    "Hong": 15,
    "Philippines": 16,
    "Ireland": 17,
    "Columbia": 18,
    "Portugal": 19,
    "Vietnam": 20,
    "Peru": 21,
    "Greece": 22,
    "Hungary": 23,
    "Puerto-Rico": 24,
    "Ecuador": 25,
    "Dominican-Republic": 26,
    "Guatemala": 27,
    "El-Salvador": 28,
    "Honduras": 29,
    "Trinadad&Tobago": 30,
    "Cambodia": 31,
    "Jamaica": 32,
    "Laos": 33,
    "Nicaragua": 34,
    "Haiti": 35,
    "England": 36,
    "Outlying-US(Guam-USVI-etc)": 37,
    "Cuba": 38,
    "Scotland": 39,
    "Yugoslavia": 40
}

money_earned_conversion_dict = {
    "<=50K": 0,
    ">50K": 1
}

conversion_dict = {
    "work_class": work_class_conversion_dict,
    "martial_status": martial_status_conversion_dict,
    "relationship": relationship_conversion_dict,
    "occupation": occupation_conversion_dict,
    "race": race_conversion_dict,
    "sex": sex_conversion_dict,
    "native_country": native_country_conversion_dict,
    "money_earned": money_earned_conversion_dict
}

您在第一段代码中看到的组合元素是以下产品的单个元素。

age_bins = [(15, 25), (26, 35), (36, 45), (46, 55), (55, np.inf)]
work_class_bins = [(0, 0), (1, 2), (3, 5), (6, 7)]
education_num_bins = [(1, 4), (5, 8), (9, 12), (13, 16)]
martial_status_bins = [(0, 2), (3, 6)]
occupation_bins = [(0, 1), (2, 4), (5, 7), (8, 9), (10, 13)]
relationship_bins = [(0, 1), (2, 2), (3, 5)]
race_bins = [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
sex_bins = [(0, 0), (1, 1)]
capital_gain_bins = [(0, 0), (0, 1000), (1001, 5000), (5001, np.inf)]
capital_loss_bins = [(0, 0), (1, 1000), (1001, np.inf)]
hours_per_week_bins = [(0, 20), (21, 40), (41, 60), (60, np.inf)]
native_country_bins = [(0, 1), (2, 9), (10, 24), (25, 35), (36, 40)]
money_earned_bins = [(0, 0), (1, 1)]
fnlwgt_bins = []

fnlwgt_max = dataset["fnlwgt"].max(axis=0)
fnlwgt_min = dataset["fnlwgt"].min(axis=0)
for i in range(1, 11):
    fnlwgt_bins.append((((i - 1) * (fnlwgt_max - fnlwgt_min) / 10), (i * (fnlwgt_max - fnlwgt_min) / 10)))

combinations = product(
    age_bins,
    work_class_bins,
    fnlwgt_bins,
    education_num_bins,
    martial_status_bins,
    occupation_bins,
    relationship_bins,
    race_bins,
    sex_bins,
    capital_gain_bins,
    capital_loss_bins,
    hours_per_week_bins,
    native_country_bins,
    money_earned_bins
)

在此处找到处理后的数据集示例:https://pastebin.com/tSH4UYXc

过滤计数值稍后将用于差异隐私算法。

您对如何加快速度有任何想法吗?

你认为如果我充分利用Numpy它会更快吗?

我应该尝试做numpy.vectorize吗?

请解释解决方案,因为我是Pandas的新手,我想了解如何编写更高效的代码。

干杯,

0 个答案:

没有答案