Question

我正在使用Python处理大量数据，并且正在针对非常大的类对象列表进行循环。很显然，这将永远花费，而我正在意识到最好的解决方案是使用numpy数组对列表进行向量化。但是，我一直无法找到一种方法，该方法可以让我将对象列表转换为所需的向量。

如果我有一个包含“句子”类的5个实例的列表，并且这些对象的属性使得列表中的每个实例看起来都像这样：

{
    text: "I liked this phone.",
    rating: 5.0,
    positive: True
}

是否可以将其转换为5x3的numpy向量，其中每行[0]都会给我对象的文本？

Answer 1

所以您的对象是像这样的字典：

In [49]: dd = {
    ...:     'text': "I liked this phone.",
    ...:     'rating': 5.0,
    ...:     'positive': True
    ...: }

我可以创建一个对象dtype数组，其中包含该字典（或类似对象）的5个副本：

In [50]: arrO = np.empty((5,), object)
In [51]: dict(dd)
Out[51]: {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True}
In [52]: for i in range(5):
    ...:     arrO[i] = dict(dd)
    ...:     
In [53]: arrO
Out[53]: 
array([{'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
       {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
       {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
       {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
       {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True}],
      dtype=object)

但是这样的对象数组很像一个列表-都包含指向内存中其他位置的对象的指针：

In [54]: [dict(dd) for _ in range(5)]
Out[54]: 
[{'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
 {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
 {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
 {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
 {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True}]

列表上的迭代速度更快。对象数组上的大多数操作都涉及迭代，但reshape之类的例外情况不需要访问单个元素。

另一个选择是制作结构化数组。

制作结构化数组的关键是定义一个复合dtype，并以tuples列表的形式提供数据：

在3.6词典中，顺序是确定的，因此values给出所需的顺序：

In [55]: tuple(dd.values())
Out[55]: ('I liked this phone.', 5.0, True)

In [56]: dt = np.dtype([('text','U30'),('rating',float),('positive',bool)])
In [57]: dt
Out[57]: dtype([('text', '<U30'), ('rating', '<f8'), ('positive', '?')])

为数组添加元组列表：

In [58]: arrS = np.array([tuple(dd.values()) for _ in range(5)],dtype=dt)
In [59]: arrS
Out[59]: 
array([('I liked this phone.', 5.,  True),
       ('I liked this phone.', 5.,  True),
       ('I liked this phone.', 5.,  True),
       ('I liked this phone.', 5.,  True),
       ('I liked this phone.', 5.,  True)],
      dtype=[('text', '<U30'), ('rating', '<f8'), ('positive', '?')])

按名称访问字段。请注意，这是一个具有3个字段的一维数组（5，），而不是（5,3）数组：

In [60]: arrS['rating']
Out[60]: array([5., 5., 5., 5., 5.])
In [61]: arrS['positive']
Out[61]: array([ True,  True,  True,  True,  True])

修改字段的值：

In [62]: arrS['positive'] = [1,0,0,1,0]
In [63]: arrS['rating'] = np.arange(5)
In [64]: arrS
Out[64]: 
array([('I liked this phone.', 0.,  True),
       ('I liked this phone.', 1., False),
       ('I liked this phone.', 2., False),
       ('I liked this phone.', 3.,  True),
       ('I liked this phone.', 4., False)],
      dtype=[('text', '<U30'), ('rating', '<f8'), ('positive', '?')])

我们可以在数字字段上进行数学运算

In [65]: np.sum(arrS['rating'])
Out[65]: 10.0

使用布尔字段作为掩码：

In [66]: arrS[arrS['positive']]
Out[66]: 
array([('I liked this phone.', 0.,  True),
       ('I liked this phone.', 3.,  True)],
      dtype=[('text', '<U30'), ('rating', '<f8'), ('positive', '?')])
In [67]: arrS[~arrS['positive']]
Out[67]: 
array([('I liked this phone.', 1., False),
       ('I liked this phone.', 2., False),
       ('I liked this phone.', 4., False)],
      dtype=[('text', '<U30'), ('rating', '<f8'), ('positive', '?')])

结构化数组上的操作比对象dtype上的操作快，但比独立数组或全数值数组上的类似操作慢。

Answer 2

在我的对象类中，我创建了一个as_dict（）方法，将该对象作为字典返回。从那里，我将每个对象的字典版本应用于pandas数据框，然后调用as_matrix（）将其作为numpy数组获取。似乎可以达到目的！

将对象列表转换为Numpy向量？

2 个答案: