Question

好的 - 我陷入了两难境地。到目前为止，我的脚本将页面标题转换为类别这是基于关键词，当有匹配时，会添加一定的分数，即某些单词的值为10，有些只有1.这会累积到每个类别的总分中。

[{15: [32, 'massages']}, {45: [12, 'hair-salon']}, {23,:[3, 'automotive service']}]

索引是类别ID，第一个值是得分第二个值类别。

在某些情况下，这超过了10个类别匹配。

如何将其过滤到前60-75％

很明显，按摩和美发沙龙是最重要的，因为它们远远超过汽车服务。但是，我们如何使用这种情报进行编程呢？

我以为stddev可以提供帮助吗？

修改

我试图过滤掉得分较低的项目，例如

data = [{15: [32, 'massages']}, {45: [1, 'hair-salon']}, {23:[1, 'automotive service']}]]

按摩是此实例中唯一得分高的项目

data = [{15: [4, 'massages']}, {45: [2, 'hair-salon']}, {23:[1, 'automotive service']}]]

Stil按摩

data = [{15: [10, 'massages']}, {45: [50, 'hair-salon']}, {23:[5, 'automotive service']}]]

现在发廊（因为它远远高于其他人）

所以我不需要第一个（N）对象，更多的是，第一个对象比其他数字高x，作为标准偏差的百分比或形式。

因此50远高于10和5

10远高于3或2

然而，9,8和6大致相同

Answer 1

以下是使用heapq.nlargest()

的解决方案

import heapq

data = [{15: [32, 'massages']}, {45: [12, 'hair-salon']}, {23:[3, 'automotive service']}]

N = int(len(data) * 0.6 + 1)
print heapq.nlargest(N, data, key = lambda x: next(x.itervalues())[0])

打印：

[{15: [32, 'massages']}, {45: [12, 'hair-salon']}]

修改：如果您想要消除“得分较低的项目”，那么您需要准确定义“得分低的”的含义

以下是一些完全随意定义“低分”的代码：如果分数低于最大值的标准差超过一个分数，则分数很低：

import math

data = [{15: [32, 'massages']}, {45: [1, 'hair-salon']}, {23:[3, 'automotive service']}]

scores = [score for d in data for cat,(score,name) in d.iteritems()]
score_mean = sum(scores) / float(len(scores))
score_stdev = math.sqrt(sum(abs(s - score_mean)**2 for s in scores) / float(len(scores)))

print [d for d in data if next(d.itervalues())[0] > (max(scores) - score_stdev)]

打印：

[{15: [32, 'massages']}]

Answer 2

yourdata = [{15: [32, 'massages']}, {45: [12, 'hair-salon']}, {23:[3, 'automotive service']}]

# transfer your data into a more usable format
data = [(score,cat,name) for dat in yourdata for cat,(score,name) in dat.iteritems()]

# sort on descending score
data.sort(reverse=True)

# throw away the low-scoring items
data = data[:int(len(data)*0.6 + 1)]

返回

[(32, 15, 'massages'), (12, 45, 'hair-salon')]

（两个得分最高的项目）

Python - 查找数组中最大的数字

2 个答案: