Question

我有一个名为df的pandas DataFrame。使用df.dtypes我可以在屏幕上打印：

arrival_time      object
departure_time    object
drop_off_type      int64
extra             object
pickup_type        int64
stop_headsign     object
stop_id           object
stop_sequence      int64
trip_id           object
dtype: object

我想保存这些信息，以便我可以将其与其他数据进行比较，在别处输入类型等等。我想将其保存到本地文件，在其他数据无法执行的程序中将其恢复到其他位置。但我无法弄清楚如何。显示各种转化的结果。

df.dtypes.to_dict()
{'arrival_time': dtype('O'),
 'departure_time': dtype('O'),
 'drop_off_type': dtype('int64'),
 'extra': dtype('O'),
 'pickup_type': dtype('int64'),
 'stop_headsign': dtype('O'),
 'stop_id': dtype('O'),
 'stop_sequence': dtype('int64'),
 'trip_id': dtype('O')}
----
df.dtypes.to_json()
'{"arrival_time":{"alignment":4,"byteorder":"|","descr":[["","|O"]],"flags":63,"isalignedstruct":false,"isnative":true,"kind":"O","name":"object","ndim":0,"num":17,"str":"|O"},"departure_time":{"alignment":4,"byteorder":"|","descr":[["","|O"]],"flags":63,"isalignedstruct":false,"isnative":true,"kind":"O","name":"object","ndim":0,"num":17,"str":"|O"},"drop_off_type":{"alignment":4,"byteorder":"=","descr":[["","<i8"]],"flags":0,"isalignedstruct":false,"isnative":true,"kind":"i","name":"int64","ndim":0,"num":9,"str":"<i8"},"extra":{"alignment":4,"byteorder":"|","descr":[["","|O"]],"flags":63,"isalignedstruct":false,"isnative":true,"kind":"O","name":"object","ndim":0,"num":17,"str":"|O"},"pickup_type":{"alignment":4,"byteorder":"=","descr":[["","<i8"]],"flags":0,"isalignedstruct":false,"isnative":true,"kind":"i","name":"int64","ndim":0,"num":9,"str":"<i8"},"stop_headsign":{"alignment":4,"byteorder":"|","descr":[["","|O"]],"flags":63,"isalignedstruct":false,"isnative":true,"kind":"O","name":"object","ndim":0,"num":17,"str":"|O"},"stop_id":{"alignment":4,"byteorder":"|","descr":[["","|O"]],"flags":63,"isalignedstruct":false,"isnative":true,"kind":"O","name":"object","ndim":0,"num":17,"str":"|O"},"stop_sequence":{"alignment":4,"byteorder":"=","descr":[["","<i8"]],"flags":0,"isalignedstruct":false,"isnative":true,"kind":"i","name":"int64","ndim":0,"num":9,"str":"<i8"},"trip_id":{"alignment":4,"byteorder":"|","descr":[["","|O"]],"flags":63,"isalignedstruct":false,"isnative":true,"kind":"O","name":"object","ndim":0,"num":17,"str":"|O"}}'
----
json.dumps( df.dtypes.to_dict() )
...
TypeError: dtype('O') is not JSON serializable

----
list(xdf.dtypes)
[dtype('O'),
 dtype('O'),
 dtype('int64'),
 dtype('O'),
 dtype('int64'),
 dtype('O'),
 dtype('O'),
 dtype('int64'),
 dtype('O')]

如何保存和导出/存档pandas DataFrame的dtype信息？

Answer 1

pd.DataFrame.dtypes会返回pd.Series个对象。这意味着您可以像处理Pandas中的任何常规系列一样操纵它：

df = pd.DataFrame({'A': [''], 'B': [1.0], 'C': [1], 'D': [True]})

res = df.dtypes.to_frame('dtypes').reset_index()

print(res)

  index   dtypes
0     A   object
1     B  float64
2     C    int64
3     D     bool

输出到csv / excel / pickle

然后，您可以使用通常用于存储数据框的任何方法，例如to_csv，to_excel，to_pickle等。分发pickle的注意事项不推荐，因为它取决于版本。

输出到json

如果您希望以字典轻松存储和加载，则常用格式为json。如您所见，您需要先转换为str类型：

import json

# first create dictionary
d = res.set_index('index')['dtypes'].astype(str).to_dict()

with open('types.json', 'w') as f:
    json.dump(d, f)

with open('types.json', 'r') as f:
    data_types = json.load(f)

print(data_types)

{'A': 'object', 'B': 'float64', 'C': 'int64', 'D': 'bool'}

Answer 2

您可以使用pickle格式。

# save
df.to_pickle(file_name)

# load
df = pandas.read_pickle(file_name)

这里是documentation

Answer 3

我发现自己将 dtype 信息放在了 CSV 文件的开头。在数据帧之前读出它是微不足道的，这使得它相当不错。

示例数据帧（从 @jpp's answer 无耻地复制）：

df = pd.DataFrame({'A': [''], 'B': [1.0], 'C': [1], 'D': [True]})

为了保存，我会这样做：

with open('test.csv', 'wt') as f:
    f.write(',' + ','.join(map(str, r.dtypes)) + '\n')
    r.to_csv(f, line_terminator='\n')

我在这里为索引列添加了额外的逗号，因为我想写索引。一般来说，您不必这样做。

Reading 现在是 4 行而不是单行，但可以说更加精确。

with open('test.csv', 'rt') as f:
    types = next(f).rstrip().split(',')[1:]
    columns = next(f).rstrip().split(',')[1:]
    test = pd.read_csv(f, dtype=dict(zip(columns, types)), index_col=0, names=columns)

我在对天文数据进行目录搜索时遇到了这个问题，其中许多文本字段丢失并被错误地加载为浮点 NaN。另一种方法是在 low_memory=False 上设置 read_csv，但这会使其更加隐式而不是显式。

3 个答案: