我有像这样的pandas数据框。
这是一个术语相似性矩阵的术语。对于一个nxn术语,通过术语相似度矩阵,我想保留k个记录,这是每个术语最相似的术语记录,并用0替换其余nk个相似记录。这个玩具的例子,我想得到一个像数据帧
如果k = 2。
frame=pd.DataFrame(data=np.array([[1,0.5,3,0.3],[0.5,1,0.3,0.4],[3,0.3,1,0.7],[0.3,0.4,0.7,1]]),columns=['w1','w2','w3','w4'])
frame.index=['w1','w2','w3','w4']
您能否告诉我如何在将pandas应用于大矩阵时使用pandas对此功能进行编码。
答案 0 :(得分:1)
一种方式:
frame.where(frame.isin(frame.stack().sort_values(ascending=False).unique()[:k+1]), 0.0)
Out[88]:
w1 w2 w3 w4
w1 1.0 0.5 0.0 0.0
w2 0.5 1.0 0.0 0.0
w3 0.0 0.0 1.0 0.7
w4 0.0 0.0 0.7 1.0
解释:
答案 1 :(得分:1)
以下一种方法是利用NumPy's advanced indexing
和np.argpartition
来选择和重置每行适当数量的元素 -
k = 2 # no. of records to keep
a = frame.values # Extract the values as an array view
n = a.shape[1] - k # no. of elements to be reset per row
idx = np.argpartition(a,n,axis=1)[:,:n] # smallest n column indices per row
a[np.arange(idx.shape[0])[:,None], idx] = 0 # reset those in array/dataframe
示例运行 -
In [478]: frame=pd.DataFrame(data=np.array([[1,0.5,0.2,0.3],[0.5,1,0.3,0.4],\
...: [0.2,0.3,1,0.7],[0.3,0.4,0.7,1]]),columns=['w1','w2','w3','w4'])
...: frame.index=['w1','w2','w3','w4']
...:
In [479]: frame
Out[479]:
w1 w2 w3 w4
w1 1.0 0.5 0.2 0.3
w2 0.5 1.0 0.3 0.4
w3 0.2 0.3 1.0 0.7
w4 0.3 0.4 0.7 1.0
## After code run with k=2
In [481]: frame
Out[481]:
w1 w2 w3 w4
w1 1.0 0.5 0.0 0.0
w2 0.5 1.0 0.0 0.0
w3 0.0 0.0 1.0 0.7
w4 0.0 0.0 0.7 1.0