在数组中对重复进行分组?

时间:2013-03-07 17:39:22

标签: python numpy scipy repeat

我正在寻找一个获得一维排序数组并返回的函数 具有两列的二维数组,第一列包含非重复的 项目和第二列包含项目的重复次数。马上 我的代码如下:

def priorsGrouper(priors):
    if priors.size==0:
        ret=priors;
    elif priors.size==1:
        ret=priors[0],1;
    else:
        ret=numpy.zeros((1,2));
        pointer1,pointer2=0,0;
        while(pointer1<priors.size):
            counter=0;
            while(pointer2<priors.size and priors[pointer2]==priors[pointer1]):
                counter+=1;
                pointer2+=1;
            ret=numpy.row_stack((ret,[priors[pointer1],pointer2-pointer1]))
            pointer1=pointer2;
    return ret;
print priorsGrouper(numpy.array([1,2,2,3]))    

我的输出如下:

[[ 0.  0.]
 [ 1.  1.]
 [ 2.  2.]
 [ 3.  1.]]

首先,我无法摆脱我的[0,0]。其次我想知道是否有 一个numpy或scipy函数,或者我可以吗?

感谢。

4 个答案:

答案 0 :(得分:4)

您可以使用np.unique获取x中的唯一值,以及索引数组(称为inverse)。 inverse可以被视为x中元素的“标签”。与x本身不同,标签始终为整数,从0开始。

然后你可以拿一个bincount标签。由于标签从0开始,因此bincount不会填充很多您不关心的零。

最后,column_stack会将y和bincount加入2D数组:

In [84]: x = np.array([1,2,2,3])

In [85]: y, inverse = np.unique(x, return_inverse=True)

In [86]: y
Out[86]: array([1, 2, 3])

In [87]: inverse
Out[87]: array([0, 1, 1, 2])

In [88]: np.bincount(inverse)
Out[88]: array([1, 2, 1])

In [89]: np.column_stack((y,np.bincount(inverse)))
Out[89]: 
array([[1, 1],
       [2, 2],
       [3, 1]])

有时当数组较小时,使用普通Python方法比NumPy函数更快。我想检查一下这是否是这种情况,如果是这样,那么在NumPy方法更快之前,x有多大。

以下是作为x大小函数的各种方法的性能图: enter image description here

In [173]: x = np.random.random(1000)

In [174]: x.sort()

In [156]: %timeit using_unique(x)
10000 loops, best of 3: 99.7 us per loop

In [180]: %timeit using_groupby(x)
100 loops, best of 3: 3.64 ms per loop

In [157]: %timeit using_counter(x)
100 loops, best of 3: 4.31 ms per loop

In [158]: %timeit using_ordered_dict(x)
100 loops, best of 3: 4.7 ms per loop

对于1000的len(x)using_unique比任何测试的普通Python方法快35倍。

所以看起来using_unique速度最快,即使是非常小的len(x)


以下是用于生成图表的程序:

import numpy as np
import collections
import itertools as IT
import matplotlib.pyplot as plt
import timeit

def using_unique(x):
    y, inverse = np.unique(x, return_inverse=True)
    return np.column_stack((y, np.bincount(inverse)))

def using_counter(x):
    result = collections.Counter(x)
    return np.array(sorted(result.items()))

def using_ordered_dict(x):
    result = collections.OrderedDict()
    for item in x:
        result[item] = result.get(item,0)+1
    return np.array(result.items())

def using_groupby(x):
    return np.array([(k, sum(1 for i in g)) for k, g in IT.groupby(x)])

fig, ax = plt.subplots()
timing = collections.defaultdict(list)
Ns = [int(round(n)) for n in np.logspace(0, 3, 10)]
for n in Ns:
    x = np.random.random(n)
    x.sort()
    timing['unique'].append(
        timeit.timeit('m.using_unique(m.x)', 'import __main__ as m', number=1000))
    timing['counter'].append(
        timeit.timeit('m.using_counter(m.x)', 'import __main__ as m', number=1000))
    timing['ordered_dict'].append(
        timeit.timeit('m.using_ordered_dict(m.x)', 'import __main__ as m', number=1000))
    timing['groupby'].append(
        timeit.timeit('m.using_groupby(m.x)', 'import __main__ as m', number=1000))

ax.plot(Ns, timing['unique'], label='using_unique')
ax.plot(Ns, timing['counter'], label='using_counter')
ax.plot(Ns, timing['ordered_dict'], label='using_ordered_dict')
ax.plot(Ns, timing['groupby'], label='using_groupby')
plt.legend(loc='best')
plt.ylabel('milliseconds')
plt.xlabel('size of x')
plt.show()

答案 1 :(得分:3)

如果订单不重要,请使用计数器。

from collections import Counter
% Counter([1,2,2,3])
= Counter({2: 2, 1: 1, 3: 1})
% Counter([1,2,2,3]).items()
[(1, 1), (2, 2), (3, 1)]

为了保留订单(首次出现),您可以实现自己的Counter版本:

from collections import OrderedDict
def OrderedCounter(seq):
     res = OrderedDict()
     for x in seq:
        res.setdefault(x, 0) 
        res[x] += 1
     return res
% OrderedCounter([1,2,2,3])
= OrderedDict([(1, 1), (2, 2), (3, 1)])
% OrderedCounter([1,2,2,3]).items()
= [(1, 1), (2, 2), (3, 1)]

答案 2 :(得分:1)

如果要计算项目的重复次数,可以使用字典:

l = [1, 2, 2, 3]
d = {}
for i in l:
    if i not in d:
        d[i] = 1
    else:
        d[i] += 1
result = [[k, v] for k, v in d.items()]

对于您的示例返回:

[[1, 1],
 [2, 2], 
 [3, 1]]
祝你好运。

答案 3 :(得分:0)

首先,您不需要以分号(;)结束语句,这不是C.: - )

其次,第5行(和其他人)将ret设置为value,value,但这不是列表:

>type foo.py
def foo():
        return [1],2
a,b = foo()
print "a = {0}".format(a)
print "b = {0}".format(b)

给出:

>python foo.py
a = [1]
b = 2

第三:有更简单的方法可以做到这一点,这就是我想到的:

  • 使用Set构造函数创建唯一的项目列表
  • 创建一个列表,列出Set中每个条目出现在输入字符串
  • 中的次数
  • 使用zip()组合并将两个列表作为元组集返回(尽管这不是您要求的)

这是一种方式:

def priorsGrouper(priors):
    """Find out how many times each element occurs in a list.

    @param[in] priors List of elements
    @return Two-dimensional list: first row is the unique elements,
                second row is the number of occurrences of each element.
    """

    # Generate a `list' containing only unique elements from the input
    mySet = set(priors)

    # Create the list that will store the number of occurrences
    occurrenceCounts = []

    # Count how many times each element occurs on the input:
    for element in mySet:
        occurrenceCounts.append(priors.count(element))

    # Combine the two:
    combinedArray = zip(mySet, occurrenceCounts)
# End of priorsGrouper() ----------------------------------------------

# Check zero-element case
print priorsGrouper([])

# Check multi-element case
sampleInput = ['a','a', 'b', 'c', 'c', 'c']
print priorsGrouper(sampleInput)