Question

我正在尝试将'feature1'数组从以下数据结构转换为numpy数组，以便我可以将其输入到sklearn。但是，我在圈子中运行，因为它总是告诉我dtype=object不合适，我无法将其转换为所需的float64格式。

我想从以下结构中提取所有'feature1'作为dtype=float64的numpy数组列表，而不是dtype=object。

vec是从早期计算返回的对象。

>>>vec
[{'is_Primary': 1, 'feature1': [2, 2, 2, 0, 0.03333333333333333, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1},
{'is_Primary': 0, 'feature1': [2, 2, 1, 0, 0.5, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1}]

我尝试了以下内容：

>>> t = np.array(list(vec))
>>> t
>>>>array([ {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f5822f'), 'vectorized': 1},
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f58233'), 'vectorized': 1},
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557bcd881d41c8d9c5f58237'), 'vectorized': 1},
   ...,
   {'is_Primary': 0, 'feature1': [], 'object_id': ObjectId('557beda61d41c8e4d1aead1f'), 'vectorized': 1},
   {'is_Primary': 1, 'feature1': [2, 2, 0, 0], 'object_id': ObjectId('557beda61d41c8e4d1aead1d'), 'vectorized': 1},
   {'is_Primary': 1, 'feature1': [], 'object_id': ObjectId('557beda61d41c8e4d1aead27'), 'vectorized': 1}], dtype=object)

另外，

>>> array = np.array([x['feature1'] for x in vec])

按照其他用户的建议，给出了类似的输出：

>>> array
>>> array([[], [], [], ..., [], [2, 2, 0, 0], []], dtype=object)

我知道我可以使用'feature1'访问array[i]的内容，但我想要的是将dtype=object转换为dtype=float64，并将其制作成列表/字典其中每行将包含来自'feature1'的相应条目的vec。

我也尝试过使用pandas数据帧，但无济于事。

    >>>>pandaseries = pd.Series(df['feature1']).convert_objects(convert_numeric=True)
    >>>>pandaseries
0     []
1     []
2     []
3     []
4     []
5     []
6     []
7     []
8     []
9     []
10    []
11    []
12    []
13    []
14    []
...
7021                                                   []
7022    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7023                                                   []
7024                                                   []
7025                                                   []
7026                                                   []
7027                                                   []
7028    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7029                                                   []
7030    [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 12, 2, 24...
7031                                                   []
7032                                       [2, 2, 0.1, 0]
7033                                                   []
7034                                         [2, 2, 0, 0]
7035                                                   []
Name: feature1, Length: 7036, dtype: object
    >>>

再次，返回dtype: object。我的猜测是遍历每一行并打印出一个列表。但我无法做到这一点。也许这是一个新手问题。我做错了什么？

感谢。

Answer 1

此：

array = numpy.array ( [ x['feature1'] for x in ver ] )

或者你需要在你的例子中更清楚......

Answer 2

您可以使用其键访问字典项的值：

d ={'a':1}
d['a'] --> 1

要访问列表中的项目，您可以迭代它或使用其索引

a = [1,  2]

for thing in a:
    # do something with thing

a[0]  --> 1

map可以方便地将函数应用于iterable的所有项目，并返回结果的列表。 operator.getitem返回一个从对象中检索项目的函数。

import operator
import numpy as np
feature1 = operator.getitem('feature1')
a = np.asarray(map(feature1, vec))

vec = [{'is_Primary': 1, 'feature1': [2, 2, 2, 0, 0.03333333333333333, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1},
       {'is_Primary': 0, 'feature1': [2, 2, 1, 0, 0.5, 0], 'object_id': ObjectId('557beda51d41c8e4d1aeac25'), 'vectorized': 1}]

>>> a = np.asanyarray(map(feature1, vec))
>>> a.shape
(2, 6)
>>> print a
[[ 2.          2.          2.          0.          0.03333333  0.        ]
 [ 2.          2.          1.          0.          0.5         0.        ]]
>>> 
>>> for thing in a[1,:]:
    print type(thing)

<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
<type 'numpy.float64'>
>>>

Answer 3

让我们以列表列表或等效列表的对象数组为起点：

A = [[], [], [], [1,2,1], [], [2, 2, 0, 0], []]
A = array([[], [], [], [1,2,1], [], [2, 2, 0, 0], []], dtype=object)

如果子列表的长度相同，np.array([...])将为您提供一个二维数组，每个子列表一行，以及与其公共长度匹配的列。但由于它们的长度不相等，它只能使它成为1d数组，其中每个元素都是指向其中一个子列表的指针 - 即dtype = object。

我可以想象构建二维数组的两种方法：

将每个子列表填充到一个共同的长度
将每个子列表插入到适当大小的空数组中。

基本上它需要常见的Python迭代;它不是一个普通的任务，不具备真正的功能。

例如：

In [346]: n=len(A)
In [348]: m=max([len(x) for x in A])
In [349]: AA=np.zeros((n,m),int)
In [350]: for i,x in enumerate(A):
   .....:     AA[i,:len(x)] = x
In [351]: AA
Out[351]: 
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [1, 2, 1, 0],
       [0, 0, 0, 0],
       [2, 2, 0, 0],
       [0, 0, 0, 0]])

获得稀疏矩阵：

In [352]: from scipy import sparse
In [353]: MA=sparse.coo_matrix(AA)
In [354]: MA
Out[354]: 
<7x4 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in COOrdinate format>

没有什么神奇的，只是直接的稀疏矩阵结构。我想你可以绕过密集矩阵

有一个列表列表稀疏格式，看起来有点像你的数据。

In [356]: Ml=MA.tolil()

In [357]: Ml.rows
Out[357]: array([[], [], [], [0, 1, 2], [], [0, 1], []], dtype=object)

In [358]: Ml.data
Out[358]: array([[], [], [], [1, 2, 1], [], [2, 2], []], dtype=object)

可以想象，您可以构建一个空的sparse.lil_matrix((n,m))矩阵，并直接设置它的.data属性。但您还必须计算rows属性。

您还可以查看data，row。 col格式矩阵的coo属性，并确定从A列表列表构建等效内容很容易。

您必须决定如何将非零行填充到全长。

将dtype = object的数据结构转换为dtype = float64

3 个答案: