Question

输入数据集如下所示：

def associate_terms_with_user(unique_term_set, all_users_terms_dict):

    associated_value_return_dict = {}

    # consider the first user
    for user_id in all_users_terms_dict:

        # what terms *could* this user have possibly used
        this_user_zero_vector = []

        # this could be refactored somehow
        for term in  unique_term_set:
            this_user_zero_vector.extend('0')

        # what terms *did* this user use
        terms_belong_to_this_user = all_users_terms_dict.get(user_id)

        # let's start counting all the possible terms that this term in the personal
        # user list of words could correspond to... 
        global_term_element_index = 0

        # while this one term is in the range of all possible terms
        while global_term_element_index < len(unique_term_set):

            # start counting the number of terms he used
            local_term_set_item_index = 0

            # if this one term he used is still in the range of terms he used, counting them one by one
            while local_term_set_item_index < len(terms_belong_to_this_user):

                # if this one user term is the same as this one global term
                if list(unique_term_set)[global_term_element_index] == terms_belong_to_this_user[local_term_set_item_index]:

                    # increment the number of times this user used this term
                    this_user_zero_vector[global_term_element_index] = '1'

                # go to the next term for this user
                local_term_set_item_index += 1

            # go to the next term in the global list of all possible terms
            global_term_element_index += 1

        associated_value_return_dict.update({user_id: this_user_zero_vector})

    pprint.pprint(associated_value_return_dict)

我们首先使用以下功能创建一个词袋模型：

{'007': ['0', '0', '1'], 
 '666': ['0', '1', '1'], 
 '888': ['1', '0', '0']}

程序的输出如下：

我们如何实现一个简单的函数来根据它们之间的相似性来聚类这些向量？我设想使用here并且可能使用scikit-learn。

我以前从未这样做过，我不知道怎么样，我一般都是机器学习的新手，我甚至不知道从哪里开始。

最后，007和888可能会聚集在一起，import tensorflow as tf import pandas as pd w = tf.Variable([[5]],dtype=tf.float32) b = tf.Variable([-5],dtype=tf.float32) x = tf.placeholder(shape=(None,1),dtype=tf.float32) y = tf.add(tf.matmul(x,w),b) label = tf.placeholder(dtype=tf.float32) loss = tf.reduce_mean(tf.squared_difference(y,label)) data = pd.read_csv("D:\\dat2.csv") xs = data.iloc[:,:1].as_matrix() ys = data.iloc[:,1].as_matrix() optimizer = tf.train.GradientDescentOptimizer(0.000001).minimize(loss) sess = tf.InteractiveSession() sess.run(tf.global_variables_initializer()) for i in range(10000): sess.run(optimizer,{x:xs,label:ys}) if i%100 == 0: print(i,sess.run(w)) print(sess.run(w))本身就会独自存在，不是吗？

完整代码的有效期为here。

Answer 1

Kmean是一个好主意。

网络上的一些示例和代码：

1）使用Python link

进行文档聚类

2）使用Python link

中的scikit-learn kmeans对文本文档进行聚类

3）将一长串字符串（单词）聚类成相似性组link

4）Kaggle post link

使用python进行单词包模型的简单k-means聚类

1 个答案: