计算距离矩阵。

Question

我正在使用凝聚聚类技术对车辆数据集进行聚类。我使用两种方法来计算距离矩阵，一种方法是使用scipy.spatial.distance.euclidean，另一种方法是使用scipy.spatial-distance_matrix。因此，根据我的理解，在两种情况下我都应获得相同的结果。我想我得到了，但是当我比较这两种方法对某些元素的输出时，我得到的输出却是假的。有人可以解释一下为什么会这样吗？

复制步骤：

!wget -O cars_clus.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cars_clus.csv
filename = 'cars_clus.csv'

#Read csv
pdf = pd.read_csv(filename)

# Clean the data
pdf[[ 'sales', 'resale', 'type', 'price', 'engine_s',
       'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
       'mpg', 'lnsales']] = pdf[['sales', 'resale', 'type', 'price', 'engine_s',
       'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
       'mpg', 'lnsales']].apply(pd.to_numeric, errors='coerce')
pdf = pdf.dropna()
pdf = pdf.reset_index(drop=True)

# selecting the feature set
featureset = pdf[['engine_s',  'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap', 'mpg']]

# Normalised using minmax
from sklearn.preprocessing import MinMaxScaler
x = featureset.values #returns a numpy array
min_max_scaler = MinMaxScaler()
feature_mtx = min_max_scaler.fit_transform(x)

计算距离矩阵。

#M1 : Using scipy's euclidean

import scipy
leng = feature_mtx.shape[0]
D = scipy.zeros([leng,leng])
for i in range(leng):
    for j in range(leng):
        D[i,j] = scipy.spatial.distance.euclidean(feature_mtx[i], feature_mtx[j])
print(pd.DataFrame(D).head())

# M2 : using scipy.spatial's distance_matrix

from scipy.spatial import distance_matrix
dist_matrix = distance_matrix(feature_mtx,feature_mtx))
print(pd.DataFrame(dist_matrix).head())

您可以看到，即使在比较两个矩阵时两个结果都相同，我也无法为每个元素求真

# Comparing

pd.DataFrame(dist_matrix == D).head()

任何帮助将不胜感激。

Answer 1

基于Graipher回答，您可以尝试以下方法：

<SignatureMethod 
 Algorithm="http://www.w3.org/2001/04/xmldsig-more#hmac-sha256"/>
 <Reference URI="#_1">
<Transforms>
    <Transform 
             Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/>
</Transforms>
<DigestMethod Algorithm="http://www.w3.org/2001/04/xmlenc#sha256"/>
<DigestValue>7gZv94NUII24kIcgDRbFcPw+GYNTMoD/mu6KtILoMm0=</DigestValue>
    </Reference>
    <Reference URI="#uuid-272f781e-e13e-4fbb-9ec0-c95f5ffd1c79-1">
<Transforms>
    <Transform 
     Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/>
</Transforms>
<DigestMethod Algorithm="http://www.w3.org/2001/04/xmlenc#sha256"/>
<DigestValue>bCtGd3Z5JFHQ9XT4cht4SmGMR06f2fGK/SG8XT/MNfI=</DigestValue>
 </Reference>
 </SignedInfo>

现在问您为什么会这样。这是由浮点数的内部表示引起的问题，浮点数使用固定数量的二进制数字表示十进制数。有些十进制数字不能完全用二进制表示，因此舍入误差很小。人们常常会对这样的结果感到非常惊讶：

comp = np.isclose(dist_matrix, D)
pd.DataFrame(comp).head()

这不是错误。这是由浮点数的内部表示引起的问题，浮点数使用固定数量的二进制数字表示十进制数。某些十进制数字不能完全用二进制表示，因此舍入误差很小。

浮点数仅具有32或64位精度，因此数字在某些点处被切掉

scipy.spatial.distance.euclidean和scipy.spatial.- distance_matrix没有返回相同的结果吗？

计算距离矩阵。

1 个答案: