连续数据的分类

时间:2016-10-14 15:47:11

标签: python scikit-learn percentile

我有一个Pandas df,我在Scikit for Python中用于机器学习。 其中一列是目标值,它是连续数据(从-10到+10不等)。

从目标列,我想计算一个包含5个类的新列,其中每个类的行数相同,即如果我有1000行,我想分配到5个类,每个类大约200个。

到目前为止,我已经在Excel中完成了这项操作,与我的Python代码分开,但随着数据的增长,它变得不切实际。

在Excel中,我计算了百分位数,然后使用一些逻辑来构建类。

如何在Python中执行此操作?

2 个答案:

答案 0 :(得分:0)

#create data
import numpy as np
import pandas as pd
df = pd.DataFrame(20*np.random.rand(50, 1)-10, columns=['target'])   

#find quantiles
quantiles = df['target'].quantile([.2, .4, .6, .8])
#labeling of groups
df['group'] = 5
df['group'][df['target'] < quantiles[.8]] = 4
df['group'][df['target'] < quantiles[.6]] = 3       
df['group'][df['target'] < quantiles[.4]] = 2 
df['group'][df['target'] < quantiles[.2]] = 1 

答案 1 :(得分:0)

寻找类似问题的答案时,发现了这篇文章和以下提示:What is the difference between pandas.qcut and pandas.cut?

import numpy as np
import pandas as pd

#generate 1000 rows of uniform distribution between -10 and 10
rows = np.random.uniform(-10, 10, size = 1000)

#generate the discretization in 5 classes
rows_cut = pd.qcut(rows, 5)
classes = rows_cut.factorize()[0]