我有一个包含连续值(从0到1020)的Python列表,我希望使用K-Means策略将这些值从0到5的序数取舍。
我已经使用新的类sklearn.preprocessing.KBinsDiscretizer
来执行该操作:
def descritise_kmeans(python_arr, num_bins):
X = np.array(python_arr).reshape(-1, 1)
est = KBinsDiscretizer(n_bins=num_bins, encode='ordinal', strategy='kmeans')
est.fit(X)
Xt = est.transform(X)
return Xt
运行此方法时,出现错误:
/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/preprocessing/_discretization.py in transform(self, X)
262 atol = 1.e-8
263 eps = atol + rtol * np.abs(Xt[:, jj])
--> 264 Xt[:, jj] = np.digitize(Xt[:, jj] + eps, bin_edges[jj][1:])
265 np.clip(Xt, 0, self.n_bins_ - 1, out=Xt)
266
ValueError: bins must be monotonically increasing or decreasing
仔细观察这一点,似乎numpy.descritize
方法是引发错误的方法。这似乎是Sklearn库的错误。
当容器n_bins
的数量为6时,将引发错误。但是,当n_bins
为5时,它将起作用。
答案 0 :(得分:0)
我遇到了类似的问题,但是在设置垃圾箱的值时发现了我的错误。我的代码很简单
bins = np.array([0.0, .33, 66, 1])
data = [0.1, .2, .4, .5, .7, 8]
inds = np.digitize(data, bins, right=False)
在.66之前我错过了一个点,我的垃圾箱不是单调的。尽管它可能不是这个问题的根源,但我希望它能对某人有所帮助。
答案 1 :(得分:0)
Makeshift解决方案:
使用以下转换功能编辑sklearns源代码:sklearn / preprocessing / _discretization.py
从版本“ 0.20.2”开始,它位于第237行
def transform(self, X):
"""Discretizes the data.
Parameters
----------
X : numeric array-like, shape (n_samples, n_features)
Data to be discretized.
Returns
-------
Xt : numeric array-like or sparse matrix
Data in the binned space.
"""
check_is_fitted(self, ["bin_edges_"])
Xt = check_array(X, copy=True, dtype=FLOAT_DTYPES)
n_features = self.n_bins_.shape[0]
if Xt.shape[1] != n_features:
raise ValueError("Incorrect number of features. Expecting {}, "
"received {}.".format(n_features, Xt.shape[1]))
def ensure_monotic_increase(array):
"""
add small noise to the bin_edges[i]
when bin_edges[i] !> bin_edges[i-1]
"""
noise_overlay = np.zeros(array.shape)
for i in range(1,len(array)):
bigger = array[i]>array[i-1]
if bigger:
pass
else:
noise_overlay[i] = abs(array[i-1] * 0.0001)
return(array+noise_overlay)
bin_edges = self.bin_edges_
for jj in range(Xt.shape[1]):
# Values which are close to a bin edge are susceptible to numeric
# instability. Add eps to X so these values are binned correctly
# with respect to their decimal truncation. See documentation of
# numpy.isclose for an explanation of ``rtol`` and ``atol``.
rtol = 1.e-5
atol = 1.e-8
eps = atol + rtol * np.abs(Xt[:, jj])
old_bin_edges = bin_edges[jj][1:]
try:
Xt[:, jj] = np.digitize(Xt[:, jj] + eps, old_bin_edges)
except ValueError:
new_bin_edges = ensure_monotic_increase(old_bin_edges)
#print(old_bin_edges)
#print(new_bin_edges)
try:
Xt[:, jj] = np.digitize(Xt[:, jj] + eps, new_bin_edges)
except:
raise
np.clip(Xt, 0, self.n_bins_ - 1, out=Xt)
if self.encode == 'ordinal':
return Xt
return self._encoder.transform(Xt)
(我遇到的)问题
垃圾桶边缘彼此之间过于靠近。 可能,由于某种浮点错误,先前的bin边缘最终大于下一个bin边缘。
当打印边缘时(取消注释上述功能中的打印语句),前两个bin边缘可观察到彼此相等。印刷的bin_edges是:
[-0.1025641 -0.1025641 0.82793522] # ValueError
[-0.1025641 -0.10255385 0.82793522] # After fix
[0.2075 0.2075 0.88798077] # ValueError
[0.2075 0.20752075 0.88798077] # After fix
[ 0.7899066 0.7899066 24.31967669] # ValueError
[ 0.7899066 0.78998559 24.31967669] # After fix
[5.47545572e-18 5.47545572e-18 2.36842105e-01] # ValueError
[5.47545572e-18 5.47600326e-18 2.36842105e-01] # After fix
[5.47545572e-18 5.47545572e-18 2.82894737e-01] # ValueError
[5.47545572e-18 5.47600326e-18 2.82894737e-01] # After fix
[-0.46762302 -0.46762302 -0.00969465] # ValueError
[-0.46762302 -0.46757626 -0.00969465] # After fix