我有一个Pandas df,我在Scikit for Python中用于机器学习。 其中一列是目标值,它是连续数据(从-10到+10不等)。
从目标列,我想计算一个包含5个类的新列,其中每个类的行数相同,即如果我有1000行,我想分配到5个类,每个类大约200个。
到目前为止,我已经在Excel中完成了这项操作,与我的Python代码分开,但随着数据的增长,它变得不切实际。
在Excel中,我计算了百分位数,然后使用一些逻辑来构建类。
如何在Python中执行此操作?
答案 0 :(得分:0)
#create data
import numpy as np
import pandas as pd
df = pd.DataFrame(20*np.random.rand(50, 1)-10, columns=['target'])
#find quantiles
quantiles = df['target'].quantile([.2, .4, .6, .8])
#labeling of groups
df['group'] = 5
df['group'][df['target'] < quantiles[.8]] = 4
df['group'][df['target'] < quantiles[.6]] = 3
df['group'][df['target'] < quantiles[.4]] = 2
df['group'][df['target'] < quantiles[.2]] = 1
答案 1 :(得分:0)
寻找类似问题的答案时,发现了这篇文章和以下提示:What is the difference between pandas.qcut and pandas.cut?
import numpy as np
import pandas as pd
#generate 1000 rows of uniform distribution between -10 and 10
rows = np.random.uniform(-10, 10, size = 1000)
#generate the discretization in 5 classes
rows_cut = pd.qcut(rows, 5)
classes = rows_cut.factorize()[0]