Question

我有一个熊猫数据框，其中包含要转换为numpy结构化数组（或记录数组，在这种情况下基本上是相同的东西）的数据类型（dtypes）的混合。对于纯数字数据帧，使用to_records()方法很容易做到。我还需要将pandas列的dtypes转换为 strings 而不是 objects ，以便可以使用numpy方法tofile()将数字和字符串输出到二进制文件，但不会输出对象。

简而言之，我需要将具有dtype=object的pandas列转换为numpy结构化的字符串或unicode dtype数组。

这是一个示例，如果所有列都具有数字（浮点型或整型）dtype，则代码就足够了。

df=pd.DataFrame({'f_num': [1.,2.,3.], 'i_num':[1,2,3], 
                 'char': ['a','bb','ccc'], 'mixed':['a','bb',1]})

struct_arr=df.to_records(index=False)

print('struct_arr',struct_arr.dtype,'\n')

# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'), 
#                            ('char', 'O'), ('mixed', 'O')])

但是，由于我想以字符串dtypes结尾，因此我需要添加以下额外的且有点复杂的代码：

lst=[]
for col in struct_arr.dtype.names:  # this was the only iterator I 
                                    # could find for the column labels
    dt=struct_arr[col].dtype

    if dt == 'O':   # this is 'O', meaning 'object'

        # it appears an explicit string length is required
        # so I calculate with pandas len & max methods
        dt = 'U' + str( df[col].astype(str).str.len().max() )

    lst.append((col,dt))

struct_arr = struct_arr.astype(lst)

print('struct_arr',struct_arr.dtype)

# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'), 
#                            ('char', '<U3'), ('mixed', '<U2')])

另请参阅：How to change the dtype of certain columns of a numpy recarray?

这似乎可行，因为字符和混合dtype现在为<U3和<U2而不是'O'或'object'。我只是在检查是否有一种更简单或更优雅的方法。但是由于熊猫不像numpy那样具有本机字符串类型，也许没有？

Answer 1

据我所知，还没有本机功能。例如，系列中所有值的最大长度不会存储在任何地方。

但是，您可以通过列表理解和f字符串更有效地实现逻辑：

data_types = [(col, arr[col].dtype if arr[col].dtype != 'O' else \
               f'U{df[col].astype(str).str.len().max()}') for col in arr.dtype.names]

Answer 2

结合@jpp（为简洁起见，使用列表comp）和@hpaulj（为速度而食人to_records）的建议，我想到了以下内容，它们是更干净的代码，并且比原始代码快5倍（经过测试）通过将上面的示例数据框扩展到10,000行）：

names = df.columns
arrays = [ df[col].get_values() for col in names ]

formats = [ array.dtype if array.dtype != 'O' 
            else f'{array.astype(str).dtype}' for array in arrays ] 

rec_array = np.rec.fromarrays( arrays, dtype={'names': names, 'formats': formats} )

上面的代码将输出unicode而不是字符串，这通常可能更好一些，但是在我的情况下，我需要转换为字符串，因为我正在用fortran读取二进制文件，而字符串似乎更容易读取。因此，最好将上面的“格式”行替换为：

formats = [ array.dtype if array.dtype != 'O' 
            else array.astype(str).dtype.str.replace('<U','S') for array in arrays ]

例如<U4的dtype变成S4。

将数据框转换为rec数组（并将对象转换为字符串）

2 个答案: