Question

我想创建一个包含列表列表的numpy数组。数据类型应为float, float, string。 为什么这不起作用？（注意：我已经阅读了此question）。

import numpy

print numpy.array([[u'1.2', u'1.3', u'hello'], [u'1.4', u'1.5', u'hi']], dtype='f,f,str')

输出：

[[(4.2245014868923476e-39, 7.006492321624085e-44, '')
  (4.2245014868923476e-39, 7.146622168056567e-44, '')
  (9.275530846997402e-39, 9.918384925297198e-39, '')]
 [(4.2245014868923476e-39, 7.286752014489049e-44, '')
  (4.2245014868923476e-39, 7.42688186092153e-44, '')
  (9.642872831629367e-39, 0.0, '')]]

Answer 1

正如我之前的回答和评论所强调的，复合dtype的正常输入是元组列表。说穿了，这就是np.array的工作方式。

In [308]: numpy.array([[u'1.2', u'1.3', u'hello'], [u'1.4', u'1.5', u'hi']], dtype='f,f,str')
TypeError: a bytes-like object is required, not 'str'

使用元组列表和改进的dtype：

In [311]: numpy.array([(u'1.2', u'1.3', u'hello'), (u'1.4', u'1.5', u'hi')], dtype='f8,f8,U10')
Out[311]: 
array([( 1.2,  1.3, 'hello'), ( 1.4,  1.5, 'hi')],
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<U10')])

关于正常元组列表的可能方法（我现在无法测试）：

Make a zeros array of the right shape and dtype
Make an object array from the list of lists (or a 2d array of strings)
Assign columns of the 2d array to fields of the structured (a loop)

在少数字段上循环通常比在许多记录上循环更快。

但是，将列表列表转换为元组列表不应该那么昂贵。

In [314]: alist = [[u'1.2', u'1.3', u'hello'], [u'1.4', u'1.5', u'hi']]
In [316]: dt = np.dtype('f8,f8,U10')

使用元组列表进行设置：

In [317]: np.array([tuple(a) for a in alist], dtype=dt)
Out[317]: 
array([( 1.2,  1.3, 'hello'), ( 1.4,  1.5, 'hi')],
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<U10')])

设置字段：

In [319]: res = np.zeros(len(alist), dtype=dt)
In [320]: temp = np.array(alist)    
In [321]: temp                    # default string dtype
Out[321]: 
array([['1.2', '1.3', 'hello'],
       ['1.4', '1.5', 'hi']],
      dtype='<U5')
In [322]: for i,n in enumerate(dt.names):
     ...:     res[n] = temp[:,i]
     ...:     
In [323]: res
Out[323]: 
array([( 1.2,  1.3, 'hello'), ( 1.4,  1.5, 'hi')],
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<U10')])

对于这个小案例，元组方法列表更快。使用更长的字段可能会更快，但必须进行测试

In [325]: timeit np.array([tuple(a) for a in alist], dtype=dt)
6.26 µs ± 6.28 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [326]: %%timeit
     ...: res = np.zeros(len(alist), dtype=dt)
     ...: temp = np.array(alist)
     ...: for i,n in enumerate(dt.names):
     ...:     res[n] = temp[:,i]
     ...: 
18.2 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

但即使有很多行，元组转换也会更快：

In [334]: arr = np.random.randint(0,100,(100000,3)).astype('U10')
In [335]: alist = arr.tolist()
In [336]: timeit np.array([tuple(a) for a in alist], dtype=dt)
93.5 ms ± 322 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [337]: %%timeit
     ...: res = np.zeros(len(alist), dtype=dt)
     ...: temp = np.array(alist)
     ...: for i,n in enumerate(dt.names):
     ...:     res[n] = temp[:,i]
     ...: 
124 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

从定时循环中拉出元组理解可节省一些时间：

In [341]: %%timeit temp = [tuple(a) for a in alist]
     ...: np.array(temp, dtype=dt)
     ...: 
65.4 ms ± 98.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

将str阵列创建拉出时间：

In [342]: %%timeit temp = np.array(alist)
     ...: res = np.zeros(len(alist), dtype=dt)
     ...: for i,n in enumerate(dt.names):
     ...:     res[n] = temp[:,i]
     ...: 
71 ms ± 447 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

只需从列表中创建字符串数组比元组转换更昂贵。

Answer 2

正如我在这篇帖子in this post中所述，它与dtype ='object'

一起使用

print(numpy.array([[u'1.2', u'1.3', u'hello'], [u'1.4', u'1.5', u'hi']], dtype='object'))

（适用于python 3.7.1）

从列表列表中创建具有各种数据类型的numpy数组

2 个答案: