Question

在我真正的问题中，我将有两个信息表（x，y）。 x将拥有约260万条记录，y将拥有~10K;这两个表具有多对一（x-> y）的关系。我想基于y对x进行子集化。

我认为最匹配的帖子是this和that以及this。我选择了numpy数组。我愿意使用其他数据结构;我只是想挑选一些可以扩展的东西。我使用了合适的方法吗？还有其他帖子涵盖了这个吗？我不想使用数据库，因为我只做了一次。

以下代码试图说明我正在尝试做什么。

import numpy, copy
x=numpy.array([(1,'a'), (1, 'b'), (3,'a'), (3, 'b'), (3, 'c'), (4, 'd')], dtype=[('id', int),('category', str, 22)]  )
y=numpy.array([('a', 3.2, 0), ('b', -1, 0), ('c', 0, 0), ('d', 100, 0)], dtype=[('category', str, 20), ('value', float), ('output', int)] )
for id, category in x:
    if y[y['category']==category]['value'][0] > 3:
        y[y['category']==category]['output']=numpy.array(copy.deepcopy(id))

Answer 1

当您尝试使用布尔数组（y['category']==category）进行索引以修改原始数组（y）时必须小心，因为“fancy indexing”会返回副本（不是视图），因此修改副本不会更改原始数组y。如果您只是在普通数组上执行此操作，它可以正常工作（this confused me in the past）。但是对于你正在使用的结构化数组，即使用作赋值也不会是视图，如果使用掩码然后再使用字段名索引。这听起来很混乱，但是在你写完之后它不会起作用，注意y之前和之后没有变化：

for i, category in x:
    c = y['category']==category   #generate the mask once
    if y[c]['value'][0] > 3:
        print 'before:', y[c]['output']
        y[c]['output'] = i
        print 'after:', y[c]['output']

#output:
#before: [0]
#after: [0]
#before: [0]
#after: [0]
#before: [0]
#after: [0]

如果您使用字段访问获取视图，那么在该视图上获得精美的索引，您将获得一个有效的setitem调用：

for i, category in x:
    c = y['category']==category   #generate the mask once
    if y[c]['value'][0] > 3:
        print 'before:', y[c]['output']
        y['output'][c] = i
        print 'after:', y[c]['output']

#output:
#before: [0]
#after: [1]
#before: [1]
#after: [3]
#before: [0]
#after: [4]

如您所见，我也删除了您的副本。 i（或id，我没有使用，因为id是一个函数）只是一个整数，不需要复制。如果您确实需要复制某些内容，最好使用numpy副本而不是标准库copy，例如

y[...]['output'] = np.array(id, copy=True)

或

y[...]['output'] = np.copy(id)

事实上，copy=True应该是默认的，因此... = np.array(id)可能就足够了，但我不是复制的权威。

Answer 2

您有260万条记录，每条记录（可能）覆盖10K记录中的一条。所以可能会有很多改写。每次您写入同一位置时，在该位置完成的所有先前工作都是徒劳的。

因此，您可以通过循环y（10K唯一？类别）而不是循环遍历x（2.6M记录）来提高效率。

import numpy as np
x = np.array([(1,'a'), (1, 'b'), (3,'a'), (3, 'b'), (3, 'c'), (4, 'd')], dtype=[('id', int),('category', str, 22)]  )
y = np.array([('a', 3.2, 0), ('b', -1, 0), ('c', 0, 0), ('d', 100, 0)], dtype=[('category', str, 20), ('value', float), ('output', int)] )

for idx in np.where(y['value'] > 3)[0]:
    row = y[idx]
    category = row['category']
    # Only the last record in `x` of the right category affects `y`.
    # So find the id value for that last record in `x`
    idval = x[x['category'] == category]['id'][-1]
    y[idx]['output'] = idval

print(y)

产量

[('a', 3.2, 3) ('b', -1.0, 0) ('c', 0.0, 0) ('d', 100.0, 4)]

基于相关结构选择python结构中的记录

2 个答案: