Question

我正在与ebay合作，我有100个售出商品价格的清单。我想要做的是将每个浮点价格分成小组，然后计算这些组的排序以确定此项目的最常见的一般价格，这样我就可以自动定价我自己的项目。

最初，我认为将价格除以10美元的价值，但我意识到这不是一种很好的分组方法，因为价格因异常值或不相关的物品等而有很大差异。

如果我有这样的价格清单：[90,92,95,99,1013,1100] 我希望应用程序将值分成： {九十年代：4，数千：2}

但我不知道如何告诉python这样做。最好，最简单的我可以将这个片段集成到我的代码中，越好！

任何帮助或建议都将不胜感激！

Answer 1

您使用的技术取决于您对群体的概念。

如果已知群组数量，请将kmeans与k==2一起使用。有关纯Python中的工作代码，请参阅this link：

from kmeans import k_means, assign_data

prices = [90, 92, 95, 99, 1013, 1100]
points = [(x,) for x in prices]
centroids = k_means(points, k=2)
labeled = assign_data(centroids, points)
for centroid, group in labeled.items():
    print('Group centered around:', centroid[0])
    print([x for (x,) in group])
    print()

输出：

Group centered around: 94.0
[90, 92, 95, 99]

Group centered around: 1056.5
[1013, 1100]

或者，如果元素之间的固定最大距离定义了分组，那么只需对元素进行排序和循环，检查它们之间的距离以查看是否已启动新组：

max_gap = 100
prices.sort()
groups = []
last_price = prices[0] - (max_gap + 1)
for price in prices:
    if price - last_price > max_gap:
        groups.append([])
    groups[-1].append(price)
    last_price = price
print(groups)

输出：

[[90, 92, 95, 99], [1013, 1100]]

Answer 2

我认为对于这种事情，散点图被低估了。我建议绘制价格分布图，然后选择适合您数据的阈值，然后按所需的组添加任何描述性统计数据。

# Reproduce your data
prices = pd.DataFrame(pd.Series([90, 92, 95, 99, 1013, 1100]), columns=['price'])

# Add an arbitrary second column so I have two columns for scatter plot
prices['label'] = 'price'

# jitter=True spreads your data points out horizontally, so you can see
# clearly how much data you have in each group (groups based on vertical space)
sns.stripplot(data=prices, x='label', y='price', jitter=True)
plt.show()

200到1,000之间的任何数字都可以很好地分离您的数据。我会随意选择200，也许您会选择不同的阈值和更多的数据。

# Add group labels, Get average by group 
prices['price group'] = pd.cut(prices['price'], bins=(0,200,np.inf))
prices['group average'] = prices.groupby('price group')['price'].transform(np.mean)

   price  label price group  group average
0     90  price    (0, 200]           94.0
1     92  price    (0, 200]           94.0
2     95  price    (0, 200]           94.0
3     99  price    (0, 200]           94.0
4   1013  price  (200, inf]         1056.5
5   1100  price  (200, inf]         1056.5

Answer 3

天真的方法指向正确的方向：

> from math import log10
> from collections import Counter

> def f(i):
>     x = 10**int(log10(i))  # largest from 1, 10, 100, etc. < i
>     return i // x * x

> lst = [90, 92, 95, 99, 1013, 1100]
> c = Counter(map(f, lst))
> c
Counter({90: 4, 1000: 2})

Answer 4

假设您的存储桶大小有些任意（例如55到95之间以及介于300和366之间），那么您可以使用分箱方法将值分类到bin范围内。各种箱子尺寸的截止可以是任何你想要的，只要它们的尺寸从左到右增加。

假设这些bin值：

bins=[0,100,1000,10000]

然后：

[0,100,1000,10000]
  ^                    bin 1 -- 0    <= x < 100
      ^                bin 2 -- 100  <= x < 1000
           ^           bin 3 -- 1000 <= x < 10000

您可以使用numpy digitize执行此操作：

import numpy as np
bins=np.array([0.0,100,1000,10000])        
prices=np.array([90, 92, 95, 99, 1013, 1100])        
inds=np.digitize(prices,bins)

您也可以在纯Python中执行此操作：

bins=[0.0,100,1000,10000]
tests=zip(bins, bins[1:])
prices=[90, 92, 95, 99, 1013, 1100]
inds=[]
for price in prices:
    if price <min(bins) or price>max(bins):
        idx=-1
    else:    
        for idx, test in enumerate(tests,1):
            if test[0]<= price < test[1]:
                break
    inds.append(idx)

然后按bin（从上述任一方法的结果）分类：

for i, e in enumerate(prices):
    print "{} <= {} < {} bin {}".format(bins[inds[i]-1],e,bins[inds[i]],inds[i])

0.0 <= 90 < 100 bin 1
0.0 <= 92 < 100 bin 1
0.0 <= 95 < 100 bin 1
0.0 <= 99 < 100 bin 1
1000 <= 1013 < 10000 bin 3
1000 <= 1100 < 10000 bin 3

然后过滤掉感兴趣的值（bin 1）与异常值（bin 3）

>>> my_prices=[price for price, bin in zip(prices, inds) if bin==1]
my_prices
[90, 92, 95, 99]

分组花车以找到最常见的一般数字

4 个答案: