为了将熊猫数据框转换为普通的numpy数组,我通常使用以下便利函数:
def df2numpy(df):
df.index.name = "i"
valDf = df.values
indDf = df.index
colsDf = df.columns
colDicDf = {}
for runner in range(len(df.columns)):
colDicDf[df.columns[runner]] = runner
return valDf, indDf, colDicDf
这为我提供了
valDf
,indDf
和colDicDf
轻松访问的字典colDicDf["column_name"]
,以获得我感兴趣的列的索引。如果我想将数据框转换为结构化数组,总的来说是什么样子?
一些有用的输入可能是以下代码(请参见When to use a numpy struct or a numpy record array?):
import numpy as np
a = np.array([['2018-04-01T15:30:00'],
['2018-04-01T15:31:00'],
['2018-04-01T15:32:00'],
['2018-04-01T15:33:00'],
['2018-04-01T15:34:00']], dtype='datetime64[s]')
c = np.array([0,1,2,3,4]).reshape(-1,1)
# create the compound dtype
dtype = np.dtype(dict(names=['date', 'val'], formats=[arr.dtype for arr in (a, c)]))
# create an empty structured array
struct = np.empty(a.shape[0], dtype=dtype)
# populate the structured array with the data from your column arrays
struct['date'], struct['val'] = a.T, c.T
print(struct)
# output:
# array([('2018-04-01T15:30:00', 0), ('2018-04-01T15:31:00', 1),
# ('2018-04-01T15:32:00', 2), ('2018-04-01T15:33:00', 3),
# ('2018-04-01T15:34:00', 4)],
# dtype=[('date', '<M8[s]'), ('val', '<i8')])
答案 0 :(得分:1)
DataFrame
转换为ndarray
这是将DataFrame
转换为结构化ndarray
的常规功能:
import numpy as np
import pandas as pd
def frameToStruct(df):
# convert dataframe to record array, then cast to structured array
struct = df.to_records(index=False).view(type=np.ndarray, dtype=list(df.dtypes.items()))
# return the struct and the row labels
return struct, df.index.values
# example dataframe
df = pd.DataFrame(data=[[True, 1,2],[False, 10,20]], columns=['a','b','c'])
struct,rowlab = frameToStruct(df)
print(struct)
# output
# [( True, 1, 2) (False, 10, 20)]
print(rowlab)
# output
# [0 1]
# you don't need to keep track of columns separately, struct will do that for you
print(struct.dtype.names)
# output
# ('a', 'b', 'c')
使用结构数组而不是记录数组的一个很好的理由是,对结构化数组的列访问要快得多:
# access record array column by attribute
%%timeit
rec.c
# 4.64 µs ± 79.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# get record array column
%%timeit
rec['c']
# 3.66 µs ± 29.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# get structured array column
%%timeit
struct['c']
# 163 ns ± 4.39 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
有关更多信息,请参见this book。