Python-如何通过计数数组加快余弦相似度

时间:2019-03-06 20:37:13

标签: python arrays list scipy

我需要计算一个很大的集合的余弦相似度函数。此集合将用户和每个用户表示为对象ID的数组。下面的示例:

user_1 = [1,4,6,100,3,1]
user_2 = [4,7,8,3,3,2,200,9,100]

如果我的理解是正确的,则要计算余弦相似度,我首先需要创建计数数组,以使每个数组具有相同的表示形式。然后,我需要计算余弦相似度函数。对于计数数组,我的意思是:

#user_1 array
#                        1,2,3,4,5,6,[7-99],100,[101-200]
user_1_counting_array = [2,0,1,1,0,1,.......,1,.........]
user_2_counting_array = [0,1,2,1,0,0,1,1,1,.,1,.......,1]

(在这种情况下,点代表零)

获得这种通用表示后,我将使用sklearn的余弦相似度函数。

from scipy import spatial
s = 1 - spatial.distance.cosine(user_1_counting_array, user_2_counting_array)

问题在于,当我实际运行代码时,一切都非常慢,并且我的用户超过了1M。我知道组合会很多,但是我认为我如何创建通用表示会产生很大的瓶颈。

为完整起见,以下代表我的实现:

from collections import Counter
from scipy import spatial

def fill_array(array, counter):
    for c in counter:
        array[c] = counter[c]
    return array

user_1 = [1,4,6,100,3,1]
user_2 = [4,7,8,3,3,2,200,9,100]

user_1_c = Counter(user_1)
user_2_c = Counter(user_2)

if max(user_1_c) > max(user_2_c):
    max_a = max(user_1_c)+1
else:
    max_a = max(user_2_c)+1

user_1_c_array = [0]*max_a
user_2_c_array = [0]*max_a

fill_array(user_1_c_array, user_1_c)
fill_array(user_2_c_array, user_2_c)

result = 1 - spatial.distance.cosine(user_1_c_array, user_2_c_array)

1 个答案:

答案 0 :(得分:1)

在这里,您可以在不循环输入一百万个条目的情况下获得简短简洁的余弦相似度向量:

public void RxCallBack(IAsyncResult aResult)
{
    try
    {
            // Create Local Buffer
            byte[] receivedData = new byte[1500];

            // Create Socket to get received data
            Socket ReceiveSocket = (Socket)aResult.AsyncState;

            // Create Endpoint
            EndPoint epReceive = new IPEndPoint(IPAddress.Any, 0);

            // Extract Data...
            int UDPRxDataLength = ReceiveSocket.EndReceiveFrom(aResult, ref epReceive);

            // Copy Rx Data to Local Buffer
            Array.Copy(SocketLocal.Buffer, receivedData, UDPRxDataLength);

            //Start listening for a new message.

            // Setup for next Packet to be received
            Buffer = new byte[1500];
            SocketLocal.BeginReceiveFrom(Buffer, 0, Buffer.Length, SocketFlags.None, ref epReceive, (RxCallBack), SocketLocal);

        // I process/intepret the received data
        // ...

        // The Sender's IP Address is located in the epReceive Endpoint
        lstBox.Items.Add( "Sender IP " + ((IPEndPoint)epReceive).Address.ToString() );

    }
    catch (Exception ex)
    {

        MessageBox.Show(ex.ToString());
    }


}  // End of RxCallBack

然后您可以将这两个向量输入user_1 = [1,4,6,100,3,1] user_2 = [4,7,8,3,3,2,200,9,100] # Create a list of unique elements uniq = list(set(user_1 + user_2)) # Map all unique entrees in user_1 and user_2 duniq = {k:0 for k in uniq} def create_vector(duniq, l): dx = duniq.copy() dx.update(Counter(l)) # Count the values return list(dx.values()) # Return a list u1 = create_vector(duniq, user_1) u2 = create_vector(duniq, user_2) # u1, u2: u1 = [2, 0, 1, 1, 1, 0, 0, 0, 0, 1] u2 = [0, 1, 2, 1, 0, 1, 1, 1, 1, 1]