python / pandas / sklearn:从pairwise_distances获得最接近的匹配

时间:2016-10-19 18:41:36

标签: python pandas scikit-learn

我有一个数据框,我正在尝试使用三个类别的马哈拉诺比斯距离来获得最接近的匹配,例如:

let f True a b = a; f False a b = b in f c a b  

from io import StringIO from sklearn import metrics import pandas as pd stringdata = StringIO(u"""pid,ratio1,pct1,rsp 0,2.9,26.7,95.073615 1,11.6,29.6,96.963660 2,0.7,37.9,97.750412 3,2.7,27.9,102.750412 4,1.2,19.9,93.750412 5,0.2,22.1,96.750412 """) stats = ['ratio1','pct1','rsp'] df = pd.read_csv(stringdata) d = metrics.pairwise.pairwise_distances(df[stats].as_matrix(), metric='mahalanobis') print(df) print(d) 列是唯一标识符。

我需要做的是接受pid调用返回的ndarray并更新原始数据框,以便每行都有某种最近N个匹配的列表(所以pairwise_distances 0可能有一个有序列表,距离类似2,1,5,3,4(或实际上是什么),但我完全不知道如何在python中完成。

1 个答案:

答案 0 :(得分:1)

from io import StringIO
from sklearn import metrics

stringdata = StringIO(u"""pid,ratio1,pct1,rsp
    0,2.9,26.7,95.073615
    1,11.6,29.6,96.963660
    2,0.7,37.9,97.750412
    3,2.7,27.9,102.750412
    4,1.2,19.9,93.750412
    5,0.2,22.1,96.750412
    """)

stats = ['ratio1','pct1','rsp']
df = pd.read_csv(stringdata)

dist = metrics.pairwise.pairwise_distances(df[stats].as_matrix(),
    metric='mahalanobis')
dist = pd.DataFrame(dist)
ranks = np.argsort(dist, axis=1)
df["rankcol"] = ranks.apply(lambda row: ','.join(map(str, row)), axis=1)
df