Question

我的问题来自菲尔的this answer。代码是

df = pd.DataFrame([[1,31,2.5,1260759144], [1,1029,3,1260759179],
                    [1,1061,3,1260759182],[1,1129,2,1260759185],
                    [1,1172,4,1260759205],[2,31,3,1260759134],
                    [2,1111,4.5,1260759256]],
                   index=list(['a','c','h','g','e','b','f',]),  
                   columns=list( ['userId','movieId','rating','timestamp']) )
df.index.names=['ID No.']
df.columns.names=['Information']

def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    This is functionally equivalent to but more efficient than
    np.array(df.to_array())

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """
    v = df.values
    cols = df.columns
# df[k].dtype.type  is <class 'numpy.object_'>,I want to convert it to numpy.str
    types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
    dtype = np.dtype(types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        z[k] = v[:, i]
    return z
sa = df_to_sarray(df.reset_index())
print(sa)

菲尔的回答运作良好，而如果我运行

sa = df_to_sarray(df.reset_index())

我会得到以下结果。

array([('a', 1, 31, 2.5, 1260759144), ('c', 1, 1029, 3.0, 1260759179),
       ('h', 1, 1061, 3.0, 1260759182), ('g', 1, 1129, 2.0, 1260759185),
       ('e', 1, 1172, 4.0, 1260759205), ('b', 2, 31, 3.0, 1260759134),
       ('f', 2, 1111, 4.5, 1260759256)], 
      dtype=[('ID No.', 'O'), ('userId', '<i8'), ('movieId', '<i8'), ('rating', '<f8'), ('timestamp', '<i8')])

我希望我能得到如下的dtype。

dtype=[('ID No.', 'S'), ('userId', '<i8'), ('movieId', '<i8'), ('rating', '<f8'), ('timestamp', '<i8')]

字符串而不是对象。

我测试了df [k] .dtype.type的类型，我发现它是<class 'numpy.object_'>，我想将它转换为numpy.str。怎么做？

Answer 1

在reset_index之后，数据框的dtypes是对象和数字的混合。索引已呈现为对象，而不是字符串。

In [9]: df1=df.reset_index()
In [10]: df1.dtypes
Out[10]: 
Information
ID No.        object
userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

df1.values是一个（7,5）对象dtype数组。

使用正确的dtype，你的方法很好（我在Py3上使用'U2'）：

In [31]: v = df1.values
In [32]: dt1=np.dtype([('ID No.', 'U2'), ('userId', '<i8'), ('movieId', '<i8'), 
    ...: ('rating', '<f8'), ('timestamp', '<i8')])
In [33]: z = np.zeros(v.shape[0], dtype=dt1)
In [34]: 
In [34]: for i,k in enumerate(dt1.names):
    ...:     z[k] = v[:, i]
    ...:     
In [35]: z
Out[35]: 
array([('a', 1,   31,  2.5, 1260759144), ('c', 1, 1029,  3. , 1260759179),
       ('h', 1, 1061,  3. , 1260759182), ('g', 1, 1129,  2. , 1260759185),
       ('e', 1, 1172,  4. , 1260759205), ('b', 2,   31,  3. , 1260759134),
       ('f', 2, 1111,  4.5, 1260759256)], 
      dtype=[('ID No.', '<U2'), ('userId', '<i8'), ('movieId', '<i8'), ('rating', '<f8'), ('timestamp', '<i8')])

所以诀窍是从数据帧派生dt1。

构建后编辑types是一个选项：

In [36]: cols=df1.columns
In [37]: types = [(cols[i], df1[k].dtype.type) for (i, k) in enumerate(cols)]
In [38]: types
Out[38]: 
[('ID No.', numpy.object_),
 ('userId', numpy.int64),
 ('movieId', numpy.int64),
 ('rating', numpy.float64),
 ('timestamp', numpy.int64)]
In [39]: types[0]=(types[0][0], 'U2')
In [40]: types
Out[40]: 
[('ID No.', 'U2'),
 ('userId', numpy.int64),
 ('movieId', numpy.int64),
 ('rating', numpy.float64),
 ('timestamp', numpy.int64)]
In [41]: 
In [41]: z = np.zeros(v.shape[0], dtype=types)

在构造期间调整列dtype也有效：

def foo(atype):
    if atype==np.object_:
        return 'U2'
    return atype
In [59]: types = [(cols[i], foo(df1[k].dtype.type)) for (i, k) in enumerate(cols)]

在任何一种情况下，我们都必须提前知道我们要将object列转换为特定的string类型，而不是更通用的类型。

我不太了解pandas是否可以在我们提取数组之前更改dtype列的ID。由于列dtypes的混合，.values将是对象dtype。

如何获取<class'numpy.str'=“”>而不是<class'numpy.object _'=“”>

1 个答案: