我想在Pandas Dataframe的列上运行一个函数。 语料库是pd.Dataframe
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
corpus = pd.DataFrame([[3,1,1,1,1,60],[2,2,0,2,0,20], [0,2,1,1,0,0], [0,0,2,1,0,1],[0,0,0,0,1,0]],index=["stark","groß","schwach","klein", "dick"],columns=["d1", "d2", "d3","d4","d5","d6"])
我有查询。查询是熊猫系列。
query = pd.Series([1,1,0,0,0], index=["stark","groß","schwach","klein", "dick"])
现在我想在语料库和查询中的每一列上运行余弦函数。
for column in corpus:
print("Similarity of Documents", column," and query: \n" ,1-cosine(query, corpus[column]))
有没有更好的方法在列上运行余弦函数?也许是一些获取列并在每列上运行函数的方法。我想避免使用for循环。
答案 0 :(得分:2)
你可以使用scipy.spatial.distance.cdist's
'cosine'
功能进行矢量化处理,就像这样 -
from scipy.spatial.distance import cdist
out = 1-cdist(query.values[None], corpus.values.T, 'cosine')
示例运行 -
In [192]: corpus
Out[192]:
d1 d2 d3 d4 d5 d6
stark 3 1 1 1 1 60
groß 2 2 0 2 0 20
schwach 0 2 1 1 0 0
klein 0 0 2 1 0 1
dick 0 0 0 0 1 0
In [193]: query
Out[193]:
stark 1
groß 1
schwach 0
klein 0
dick 0
dtype: int64
In [194]: from scipy.spatial.distance import cosine
In [195]: for column in corpus:
...: print(1-cosine(query, corpus[column]))
...:
0.980580675691
0.707106781187
0.288675134595
0.801783725737
0.5
0.89431540856
In [196]: 1-cdist(query.values[None], corpus.values.T, 'cosine')
Out[196]: array([[ 0.98058, 0.70711, 0.28868, 0.80178, 0.5 , 0.89432]])
运行时测试 -
In [225]: corpus = pd.DataFrame(np.random.rand(100,10000))
In [226]: query = pd.Series(np.random.rand(100))
# @C.Square's apply based soln
In [227]: %timeit corpus.apply(lambda x:1-cosine(query, x), axis=0)
1 loop, best of 3: 352 ms per loop
# Proposed in this post using cdist()
In [228]: %timeit 1-cdist(query.values[None], corpus.values.T, 'cosine')
100 loops, best of 3: 3.2 ms per loop
答案 1 :(得分:1)
您还可以使用1
的定义并自行实施
cosine
pandas
corpus.T.dot(query) / (corpus ** 2).sum() ** .5 / (query ** 2).sum() ** .5
d1 0.980581
d2 0.707107
d3 0.288675
d4 0.801784
d5 0.500000
d6 0.894315
dtype: float64
numpy
c = corpus.values
q = query.values
r = c.T.dot(q) / (c ** 2).sum(0) ** .5 / (q ** 2).sum() ** .5
pd.Series(r, corpus.columns)
d1 0.980581
d2 0.707107
d3 0.288675
d4 0.801784
d5 0.500000
d6 0.894315
dtype: float64
np.einsum
答案 2 :(得分:0)
apply
- 功能是一种整洁,可读和快速的方式来完成这样的工作:
import pandas as pd
from scipy.spatial.distance import cosine
corpus = pd.DataFrame([[3,1,1,1,1,60],[2,2,0,2,0,20], [0,2,1,1,0,0], [0,0,2,1,0,1],[0,0,0,0,1,0]], index=["stark","groß","schwach","klein", "dick"], columns=["d1", "d2", "d3","d4","d5","d6"])
query = pd.Series([1,1,0,0,0], index=["stark","groß","schwach","klein", "dick"])
corpus.apply(lambda x:1-cosine(query, x), # Apply your function
axis=0) # For each column
# d1 0.980581
# d2 0.707107
# d3 0.288675
# d4 0.801784
# d5 0.500000
# d6 0.894315
# dtype: float64