Question

我正在寻找一个获得一维排序数组并返回的函数具有两列的二维数组，第一列包含非重复的项目和第二列包含项目的重复次数。马上我的代码如下：

def priorsGrouper(priors):
    if priors.size==0:
        ret=priors;
    elif priors.size==1:
        ret=priors[0],1;
    else:
        ret=numpy.zeros((1,2));
        pointer1,pointer2=0,0;
        while(pointer1<priors.size):
            counter=0;
            while(pointer2<priors.size and priors[pointer2]==priors[pointer1]):
                counter+=1;
                pointer2+=1;
            ret=numpy.row_stack((ret,[priors[pointer1],pointer2-pointer1]))
            pointer1=pointer2;
    return ret;
print priorsGrouper(numpy.array([1,2,2,3]))

我的输出如下：

[[ 0.  0.]
 [ 1.  1.]
 [ 2.  2.]
 [ 3.  1.]]

首先，我无法摆脱我的[0,0]。其次我想知道是否有一个numpy或scipy函数，或者我可以吗？

感谢。

Answer 1

您可以使用np.unique获取x中的唯一值，以及索引数组（称为inverse）。 inverse可以被视为x中元素的“标签”。与x本身不同，标签始终为整数，从0开始。

然后你可以拿一个bincount标签。由于标签从0开始，因此bincount不会填充很多您不关心的零。

最后，column_stack会将y和bincount加入2D数组：

In [84]: x = np.array([1,2,2,3])

In [85]: y, inverse = np.unique(x, return_inverse=True)

In [86]: y
Out[86]: array([1, 2, 3])

In [87]: inverse
Out[87]: array([0, 1, 1, 2])

In [88]: np.bincount(inverse)
Out[88]: array([1, 2, 1])

In [89]: np.column_stack((y,np.bincount(inverse)))
Out[89]: 
array([[1, 1],
       [2, 2],
       [3, 1]])

有时当数组较小时，使用普通Python方法比NumPy函数更快。我想检查一下这是否是这种情况，如果是这样，那么在NumPy方法更快之前，x有多大。

以下是作为x大小函数的各种方法的性能图： enter image description here

In [173]: x = np.random.random(1000)

In [174]: x.sort()

In [156]: %timeit using_unique(x)
10000 loops, best of 3: 99.7 us per loop

In [180]: %timeit using_groupby(x)
100 loops, best of 3: 3.64 ms per loop

In [157]: %timeit using_counter(x)
100 loops, best of 3: 4.31 ms per loop

In [158]: %timeit using_ordered_dict(x)
100 loops, best of 3: 4.7 ms per loop

对于1000的len(x)，using_unique比任何测试的普通Python方法快35倍。

所以看起来using_unique速度最快，即使是非常小的len(x)。

以下是用于生成图表的程序：

import numpy as np
import collections
import itertools as IT
import matplotlib.pyplot as plt
import timeit

def using_unique(x):
    y, inverse = np.unique(x, return_inverse=True)
    return np.column_stack((y, np.bincount(inverse)))

def using_counter(x):
    result = collections.Counter(x)
    return np.array(sorted(result.items()))

def using_ordered_dict(x):
    result = collections.OrderedDict()
    for item in x:
        result[item] = result.get(item,0)+1
    return np.array(result.items())

def using_groupby(x):
    return np.array([(k, sum(1 for i in g)) for k, g in IT.groupby(x)])

fig, ax = plt.subplots()
timing = collections.defaultdict(list)
Ns = [int(round(n)) for n in np.logspace(0, 3, 10)]
for n in Ns:
    x = np.random.random(n)
    x.sort()
    timing['unique'].append(
        timeit.timeit('m.using_unique(m.x)', 'import __main__ as m', number=1000))
    timing['counter'].append(
        timeit.timeit('m.using_counter(m.x)', 'import __main__ as m', number=1000))
    timing['ordered_dict'].append(
        timeit.timeit('m.using_ordered_dict(m.x)', 'import __main__ as m', number=1000))
    timing['groupby'].append(
        timeit.timeit('m.using_groupby(m.x)', 'import __main__ as m', number=1000))

ax.plot(Ns, timing['unique'], label='using_unique')
ax.plot(Ns, timing['counter'], label='using_counter')
ax.plot(Ns, timing['ordered_dict'], label='using_ordered_dict')
ax.plot(Ns, timing['groupby'], label='using_groupby')
plt.legend(loc='best')
plt.ylabel('milliseconds')
plt.xlabel('size of x')
plt.show()

Answer 2

如果订单不重要，请使用计数器。

from collections import Counter
% Counter([1,2,2,3])
= Counter({2: 2, 1: 1, 3: 1})
% Counter([1,2,2,3]).items()
[(1, 1), (2, 2), (3, 1)]

为了保留订单（首次出现），您可以实现自己的Counter版本：

from collections import OrderedDict
def OrderedCounter(seq):
     res = OrderedDict()
     for x in seq:
        res.setdefault(x, 0) 
        res[x] += 1
     return res
% OrderedCounter([1,2,2,3])
= OrderedDict([(1, 1), (2, 2), (3, 1)])
% OrderedCounter([1,2,2,3]).items()
= [(1, 1), (2, 2), (3, 1)]

Answer 3

如果要计算项目的重复次数，可以使用字典：

l = [1, 2, 2, 3]
d = {}
for i in l:
    if i not in d:
        d[i] = 1
    else:
        d[i] += 1
result = [[k, v] for k, v in d.items()]

对于您的示例返回：

[[1, 1],
 [2, 2], 
 [3, 1]]

祝你好运。

Answer 4

首先，您不需要以分号（;）结束语句，这不是C.： - ）

其次，第5行（和其他人）将ret设置为value,value，但这不是列表：

>type foo.py
def foo():
        return [1],2
a,b = foo()
print "a = {0}".format(a)
print "b = {0}".format(b)

给出：

>python foo.py
a = [1]
b = 2

第三：有更简单的方法可以做到这一点，这就是我想到的：

使用Set构造函数创建唯一的项目列表
创建一个列表，列出Set中每个条目出现在输入字符串
使用zip（）组合并将两个列表作为元组集返回（尽管这不是您要求的）

这是一种方式：

def priorsGrouper(priors):
    """Find out how many times each element occurs in a list.

    @param[in] priors List of elements
    @return Two-dimensional list: first row is the unique elements,
                second row is the number of occurrences of each element.
    """

    # Generate a `list' containing only unique elements from the input
    mySet = set(priors)

    # Create the list that will store the number of occurrences
    occurrenceCounts = []

    # Count how many times each element occurs on the input:
    for element in mySet:
        occurrenceCounts.append(priors.count(element))

    # Combine the two:
    combinedArray = zip(mySet, occurrenceCounts)
# End of priorsGrouper() ----------------------------------------------

# Check zero-element case
print priorsGrouper([])

# Check multi-element case
sampleInput = ['a','a', 'b', 'c', 'c', 'c']
print priorsGrouper(sampleInput)

在数组中对重复进行分组？

4 个答案: