Question

说，我有一个像这样的numpy数组：

import numpy as np

x= np.array(
    [[100, 14, 12, 15],
    [100, 21, 16, 11],
    [100, 19, 10, 13],
    [160, 24, 15, 12],
    [160, 43, 12, 65],
    [160, 17, 53, 23],
    [300, 15, 17, 11],
    [300, 66, 23, 12],
    [300, 44, 70, 19]])

原始数组要大得多，所以我的问题是，是否可以根据第1列的值对行进行分区或分组？例如：

{'100': [[14, 12, 15], [21, 16, 11], [19, 10, 13]],
,'160': [[24, 15, 12], [43, 12, 65], [17, 53, 23]],
,'300': [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

Answer 1

我们正在讨论大型数据集，因此我们可能需要性能，因为我们还将输入数据作为NumPy数组。本文中列出了两种NumPy方法。

方法＃1

这是一种使用np.unique获取行索引分隔组然后使用循环理解来获取字典输出的方法 -

unq, idx = np.unique(x[:,0], return_index=1)
idx1 = np.r_[idx,x.shape[0]]
dict_out = {unq[i]:x[idx1[i]:idx1[i+1],1:] for i in range(len(unq))}

这假定第一列按问题标题...repeated value in one column中的说明进行排序。如果情况并非如此，我们需要使用x[:,0].argsort()对x行进行排序，然后继续。

示例输入，输出 -

In [41]: x
Out[41]: 
array([[100,  14,  12,  15],
       [100,  21,  16,  11],
       [100,  19,  10,  13],
       [160,  24,  15,  12],
       [160,  43,  12,  65],
       [160,  17,  53,  23],
       [300,  15,  17,  11],
       [300,  66,  23,  12],
       [300,  44,  70,  19]])

In [42]: dict_out
Out[42]: 
{100: array([[14, 12, 15],
        [21, 16, 11],
        [19, 10, 13]]), 160: array([[24, 15, 12],
        [43, 12, 65],
        [17, 53, 23]]), 300: array([[15, 17, 11],
        [66, 23, 12],
        [44, 70, 19]])}

方法＃2

这是另一个摆脱np.unique以进一步提升绩效的方法 -

idx1 = np.concatenate(([0],np.flatnonzero(x[1:,0] != x[:-1,0])+1, [x.shape[0]]))
dict_out = {x[i,0]:x[i:j,1:] for i,j in zip(idx1[:-1], idx1[1:])}

运行时测试

方法 -

# @COLDSPEED's soln
from collections import defaultdict
def defaultdict_app(x):
    data = defaultdict(list)
    for l in x:
        data[l[0]].append(l[1:])

# @David Z's soln-1
import pandas as pd
def pandas_groupby_app(x):
    df = pd.DataFrame(x)
    return {key: group.iloc[:,1:] for key, group in df.groupby(0)}

# @David Z's soln-2
import itertools as it
def groupby_app(x):
    return {key: list(map(list, group)) for key, group in \
                        it.groupby(x, lambda row: row[0])}

# Proposed in this post    
def numpy_app1(x):
    unq, idx = np.unique(x[:,0], return_index=1)
    idx1 = np.r_[idx,x.shape[0]]
    return {unq[i]:x[idx1[i]:idx1[i+1],1:] for i in range(len(unq))}

# Proposed in this post    
def numpy_app2(x):
    idx1 = np.concatenate(([0],np.flatnonzero(x[1:,0] != x[:-1,0])+1, [x.shape[0]]))
    return {x[i,0]:x[i:j,1:] for i,j in zip(idx1[:-1], idx1[1:])}

计时 -

In [84]: x = np.random.randint(0,100,(10000,4))

In [85]: x[:,0].sort()

In [86]: %timeit defaultdict_app(x)
    ...: %timeit pandas_groupby_app(x)
    ...: %timeit groupby_app(x)
    ...: %timeit numpy_app1(x)
    ...: %timeit numpy_app2(x)
    ...: 
100 loops, best of 3: 4.43 ms per loop
100 loops, best of 3: 15 ms per loop
100 loops, best of 3: 12.1 ms per loop
1000 loops, best of 3: 310 µs per loop
10000 loops, best of 3: 75.6 µs per loop

Answer 2

由于您将此标记为pandas，因此您可能希望使用DataFrame的{{3}}进行此操作。您将从原始数组

创建DataFrame

import pandas as pd
df = pd.DataFrame(x)

并按第一列分组;然后，您可以迭代生成的GroupBy对象，以获取第一列中具有相同结果的帧组。

{key: group for key, group in df.groupby(0)}

当然，在此代码段group中包含第一列。您可以使用索引删除它：

{key: group.iloc[:,1:] for key, group in df.groupby(0)}

如果您想将子帧转换回Numpy数组，请改用group.iloc[:,1:].values。（如果你想要它们作为列表列表，如你的问题中所示，编写一个函数来进行转换应该不难，但是如果你将它保存在Pandas中或者至少是Numpy可能会更有效率可以。）

另一种方法是使用OG groupby() functionality，它可以让你避免Pandas（如果你有这样的理由）并使用一个简单的迭代方法。

import itertools as it
{key: list(map(list, group))
    for key, group in it.groupby(x, lambda row: row[0])}

这再次包括结果行中的键，但您可以使用索引

修剪它

{key: list(map(lambda a: list(a)[1:], group))
    for key, group in it.groupby(x, lambda row: row[0])}

您可以使用groupby() from itertools（未包含在标准Python库中）使代码更清晰：

import more_itertools as mt
{key: list(group) for key, group in mt.groupby_transform(
    x, lambda row: row[0], lambda row: list(row[1:])
)}

^{披露：我将groupby_transform()函数提供给more-itertools}

Answer 3

您可以使用collections.defaultdict和循环对数据进行分组。

from collections import defaultdict

data = defaultdict(list)
for l in x:
    data[l[0]].append(l[1:])

print(dict(data))

输出：

{100: [[14, 12, 15], [21, 16, 11], [19, 10, 13]],
 160: [[24, 15, 12], [43, 12, 65], [17, 53, 23]],
 300: [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

Answer 4

我想你想要这样

编辑后

ls_dict={}
for ls in x:
    key=ls[0]
    value=[ls[1:]]
    if key in ls_dict:
        value = ls[1:]
        ls_dict[key].append(value)
    else:
        value = [ls[1:]]
        ls_dict[key]=value
print(ls_dict)

<强>输出

{100: [[14, 12, 15], [21, 16, 11], [19, 10, 13]], 160: [[24, 15, 12], [43, 12, 65], [17, 53, 23]], 300: [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

Python：如何通过一列中的重复值来存储一组数据

4 个答案: