我使用以下链接创建了一个" Euclidean相似度矩阵" (我转换为DataFrame)。 https://stats.stackexchange.com/questions/53068/euclidean-distance-score-and-similarity http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.euclidean.html
我这样做的方式是一种迭代的方法,但是当数据集很大时需要一段时间。 pandas pd.DataFrame.corr()对于皮尔森相关性来说非常快速且有用。
如何在没有详尽迭代的情况下执行欧几里得相似性度量?
我的天真代码如下:
#Euclidean Similarity
#Create DataFrame
DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]}).T
DF_var.columns = ["g1","g2","g3"]
# g1 g2 g3
# s1 1.2 3.4 10.2
# s2 1.4 3.1 10.7
# s3 2.1 3.7 11.3
# s4 1.5 3.2 10.9
#Create empty matrix to fill
M_euclid = np.zeros((DF_var.shape[1],DF_var.shape[1]))
#Iterate through DataFrame columns to measure euclidean distance
for i in range(DF_var.shape[1]):
u = DF_var[DF_var.columns[i]]
for j in range(DF_var.shape[1]):
v = DF_var[DF_var.columns[j]]
#Euclidean distance -> Euclidean similarity
M_euclid[i,j] = (1/(1+sp.spatial.distance.euclidean(u,v)))
DF_euclid = pd.DataFrame(M_euclid,columns=DF_var.columns,index=DF_var.columns)
# g1 g2 g3
# g1 1.000000 0.215963 0.051408
# g2 0.215963 1.000000 0.063021
# g3 0.051408 0.063021 1.000000
答案 0 :(得分:8)
scipy.spatial.distance
中有两个有用的功能可用于此:pdist
和squareform
。使用pdist
会将观察值之间的成对距离作为一维数组,squareform
会将此值转换为距离矩阵。
一个问题是,pdist
默认使用距离测量,而不是相似性,因此您需要手动指定相似度函数。根据代码中的注释输出判断,您的DataFrame也不在pdist
预期的方向,因此我撤消了您在代码中执行的转置。
import pandas as pd
from scipy.spatial.distance import euclidean, pdist, squareform
def similarity_func(u, v):
return 1/(1+euclidean(u,v))
DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]})
DF_var.index = ["g1","g2","g3"]
dists = pdist(DF_var, similarity_func)
DF_euclid = pd.DataFrame(squareform(dists), columns=DF_var.index, index=DF_var.index)
答案 1 :(得分:2)
我认为您可以使用pdist
和squareform
直接在您的DataFrame上广播:
from scipy.spatial.distance import pdist,squareform
In [6]: squareform(pdist(DF_var, metric='euclidean'))
Out[6]:
array([[ 0. , 0.6164414 , 1.4525839 , 0.78740079],
[ 0.6164414 , 0. , 1.1 , 0.24494897],
[ 1.4525839 , 1.1 , 0. , 0.87749644],
[ 0.78740079, 0.24494897, 0.87749644, 0. ]])
答案 2 :(得分:1)
您需要scipy.spatial.distance.pdist
或sklearn.metrics.pairwise.pairwise_distances
答案 3 :(得分:0)
我能找到与OP获得相同结果的最简单方法是使用distance_matrix,也来自scipy.spatial。整个过程可以用一种长线来完成。
import numpy as np
import pandas as pd
from scipy.spatial import distance_matrix
# Original code from OP, slightly reformatted
DF_var = pd.DataFrame.from_dict({
"s1":[1.2,3.4,10.2],
"s2":[1.4,3.1,10.7],
"s3":[2.1,3.7,11.3],
"s4":[1.5,3.2,10.9]
}).T
DF_var.columns = ["g1","g2","g3"]
# Whole similarity algorithm in one line
df_euclid = pd.DataFrame(
1 / (1 + distance_matrix(DF_var.T, DF_var.T)),
columns=DF_var.columns, index=DF_var.columns
)
# g1 g2 g3
# g1 1.000000 0.215963 0.051408
# g2 0.215963 1.000000 0.063021
# g3 0.051408 0.063021 1.000000
上面的代码应该在任何python IDE中复制粘贴并运行。
答案 4 :(得分:0)
这就是我所做的:
from scipy.spatial.distance import euclidean
DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]}).T
DF_var.columns = ["g1","g2","g3"]
def m_euclid(v1, v2):
return (1/(1 + euclidean(v1,v2)))
dist_list = []
for j1 in DF_var.columns:
dist_list.append([m_euclid(DF_var[j1], DF_var[j2]) for j2 in DF_var.columns])
dist_matrix = pd.DataFrame(dist_list)