Question

我必须聚集一些json格式的文档。我想修补特征散列以减小尺寸。从小处开始，这是我的意见：

doc_a = { "category": "election, law, politics, civil, government",
          "expertise": "political science, civics, republican"
        }

doc_b = { "category": "Computers, optimization",
          "expertise": "computer science, graphs, optimization"
        }
doc_c = { "category": "Election, voting",
          "expertise": "political science, republican"
        }
doc_d = { "category": "Engineering, Software, computers",
          "expertise": "computers, programming, optimization"
        }
doc_e = { "category": "International trade, politics",
          "expertise": "civics, political activist"
        }

现在，我如何使用特征散列，为每个文档创建向量，然后计算相似性并创建集群？阅读http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html后我有点迷失了。不确定我是否必须使用＆＃34; dict＆＃34;或者将我的数据转换成一些整数，然后使用＆＃39; pair＆＃39;对于＆＃39; input_type＆＃39;我的特色哈希。我该如何解释featureHasher的输出？例如，示例http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html输出一个numpy数组。

In [1]: from sklearn.feature_extraction import FeatureHasher

In [2]: hasher = FeatureHasher(n_features=10, non_negative=True, input_type='pair')

In [3]: x_new = hasher.fit_transform([[('a', 1), ('b', 2)], [('a', 0), ('c', 5)]])

In [4]: x_new.toarray()
Out[4]:
array([[ 1.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  5.,  0.,  0.]])

In [5]:

我认为行是文档和列值是..？比方说，如果我想聚类或找到这些向量之间的相似性（使用余弦或Jaccard），不确定我是否必须进行逐项比较？

预期输出：doc_a，doc_c和doc_e应该在一个群集中，其余的在另一个群集中。

谢谢！

Answer 1

如果您使用<select ng-model="$ctrl.item.modifiers" ng-options="modifier as modifier.name for modifier in $ctrl.modifiers" multiple chosen class="chosen-select" tabindex="4" > </select>而不是app.controller("my-controller", function() { var $ctrl = this; $ctrl.modifiers = [{ id: 1, name: "Extra Cheese" }, { id: 2, name: "No Cheese" }]; $ctrl.item = { modifiers: [] } $ctrl.$onInit = function() { const id1 = 1; const id2 = 2; for (const modifier of $ctrl.modifiers) { if (modifier.id === id1 || modifier.id === id2) { $ctrl.item.modifiers.push(modifier); } } } }来解决此问题，那么您可以更轻松地完成工作。 HashingVectorizer负责对输入数据进行标记，并可以接受字符串列表。

问题的主要挑战是您实际上有两种文字功能，FeatureHasher和HashingVectorizer。在这种情况下的技巧是为两个特征拟合散列矢量化器，然后组合输出：

category

使用功能散列

1 个答案: