如何一次将数据中的所有列分类? (使所有值变为高,中,低)

时间:2019-01-25 17:05:23

标签: python pandas dataframe categorical-data

我正在尝试将数据集中的所有值转换为分类值,我希望根据其分位数将所有数值分类为低,平均或高。

因此,如果该值低于系列的25%,则将其转换为“低”

我尝试使用分配,然后应用了我提供的功能:

def turn_into_categorical(row):
    quantile_level = [.25, .5, .75]
    for r in row:
        cut = refugees_T_F_V_P_full_data.r.quantile(quantile_level)
        if r >= cut[.75]:
            return "High"
        elif r >= cut[.25] and r < cut[0.75]:
            return "Average"
        else:
            return "Low"

refugees_T_F_V_P_full_data.apply(turn_into_categorical, axis = 1)

但是,代码无法正常运行。我也通过迭代尝试过,但是我想知道是否有更快的方法?

这是我要转换的数据,除年和月以外的所有数字都应根据分位数来分类为低,中,高。

    Year  Month  Central Equatoria  Eastern Equatoria  Gogrial  Jonglei
0   2014     10                6.0                1.0      0.0      3.0   
1   2014     11                4.0                3.0      0.0     12.0   
2   2014     12                3.0                5.0      0.0     11.0   
3   2015      1                7.0                2.0      0.0      4.0   
4   2015      2                5.0                5.0      0.0     10.0   
5   2015      3                7.0                5.0      0.0      8.0   
6   2015      4                4.0                1.0      0.0      6.0   
7   2015      5                5.0                0.0      0.0      7.0   
8   2015      6                4.0                1.0      0.0      6.0   
9   2015      7               15.0                2.0      0.0      9.0   
10  2015      8               10.0                7.0      0.0      9.0   
11  2015      9               12.0                0.0      0.0      8.0   
12  2015     10               12.0                0.0      0.0      5.0   
13  2015     11                8.0                5.0      0.0     10.0   
14  2015     12                5.0                7.0      0.0      3.0 

预期结果:(示例)

    Year  Month  Central Equatoria  Eastern Equatoria  Gogrial  Jonglei
0   2014     10                High             Medium      Low      Medium  
1   2014     11                Low              Medium      Low     high

4 个答案:

答案 0 :(得分:2)

看起来像您想要的pd.qcut,它正是这样做的。从文档中:

  

基于分位数的离散化功能

因此,您可以apply pd.qcutCentral Equatoria开始沿数据框的各列,指定要用于将序列与q = [0, 0.25, 0.75, 1.0]进行分档的分位数

df.loc[:,'Central Equatoria':].apply(lambda x: pd.qcut(x, q=[0, 0.25, 0.75, 1.0], 
                                    labels =['low','medium','high']) 
                                    if not x.nunique() == 1 else 'low'))

输出

       Central Equatoria Eastern Equatoria Gogrial Jonglei
0            medium              low     low     low
1               low           medium     low    high
2               low           medium     low    high
3            medium           medium     low     low
4            medium           medium     low    high
5            medium           medium     low  medium
6               low              low     low  medium
7            medium              low     low  medium
8               low              low     low  medium
9              high           medium     low  medium
10             high             high     low  medium
11             high              low     low  medium
12             high              low     low     low
13           medium           medium     low    high
14           medium             high     low     low

答案 1 :(得分:1)

pd.DataFrame.quantilepd.Series.cut结合使用的一个想法:

cats = ['Low', 'Medium', 'High']
quantiles = df.iloc[:, 2:].quantile([0, 0.25, 0.75, 1.0])

for col in df.iloc[:, 2:]:
    bin_edges = quantiles[col]
    # special case situations where all values are equal
    if bin_edges.nunique() == 1:
        df[col] = 'Low'
    else:
        df[col] = pd.cut(df[col], bins=bin_edges, labels=cats, include_lowest=True)

结果:

print(df)

    Year  Month CentralEquatoria EasternEquatoria Gogrial Jonglei
0   2014     10           Medium              Low     Low     Low
1   2014     11              Low           Medium     Low    High
2   2014     12              Low           Medium     Low    High
3   2015      1           Medium           Medium     Low     Low
4   2015      2           Medium           Medium     Low    High
5   2015      3           Medium           Medium     Low  Medium
6   2015      4              Low              Low     Low  Medium
7   2015      5           Medium              Low     Low  Medium
8   2015      6              Low              Low     Low  Medium
9   2015      7             High           Medium     Low  Medium
10  2015      8             High             High     Low  Medium
11  2015      9             High              Low     Low  Medium
12  2015     10             High              Low     Low     Low
13  2015     11           Medium           Medium     Low    High
14  2015     12           Medium             High     Low     Low

答案 2 :(得分:0)

使用pd.cut()df.apply()

df.iloc[:,2:]=df.iloc[:,2:].apply(lambda x:pd.cut(x, 3, labels=['Low','Med','High']), axis=1)

    Year    Month   Central_Equatoria   Eastern_Equatoria   Gogrial Jonglei
0   2014    10      High    Low         Low                 Med
1   2014    11      Low     Low         Low                 High
2   2014    12      Low     Med         Low                 High
3   2015    1       High    Low         Low                 Med
4   2015    2       Med     Med         Low                 High
5   2015    3       High    Med         Low                 High

答案 3 :(得分:0)

最后使用最古老的方式:

new_df = pd.DataFrame()
name_list = list(df)

for name in name_list:
    if name != 'Year' and name != 'Month':
        new_row = []
        quantiles = df[name].quantile([.25, .5, .75])
        row_list = df[name].tolist()
        for i, value in enumerate(row_list):
            if value < quantiles[.25]:
                new_row.append("Low")
            elif value < quantiles[.75] and value >= quantiles[.25]:
                new_row.append("Average")
            else:
                new_row.append("High")
        series = pd.Series(new_row)
        new_df[name] = series.values

new_df.head()