Question

我想将Excel文档的内容提取到pandas数据帧中，然后将该数据帧写入HDF5文件。为此，我做到了这一点：

xls_df = pd.read_excel(fn_xls)
xls_df.to_hdf(fn_h5, 'table', format='table', mode='w')

这会导致以下错误：

TypeError：无法序列化[Col1]列，因为其数据内容为[unicode] object dtype

我尝试在Excel文件的数据框架上使用convert.objects（），但这不起作用（并且不推荐使用convert.objects（））。关于此事有什么建议吗？

以下是Excel文件的一些信息：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 5 columns):
Col1                   101 non-null object
Col2                   101 non-null object
Col3                   94 non-null float64
Col4                   98 non-null object
Col5                   93 non-null float64
dtypes: float64(2), object(3)

第一列和第二列是字符串，第四列有1个字符串但主要是整数，第三列和第五列是整数。

Answer 1

列＆＃34; Col4＆＃34;中的混合字符串和整数数据类型转换为＆＃34;表＆＃34;中的HDF5会导致错误格式。

保存在hdf5＆＃34;表格中＃34;格式你需要将Col4中的数字转换为浮点数（和字符串转换为NaN）：

df["Col4"] = pd.to_numeric(df["Col4"], errors="coerce")

或者将列中的所有内容转换为字符串：

df["Col4"] = df["Col4"].astype(str)

或使用＆＃34;固定＆＃34; hdf5格式，允许列具有混合数据类型。这将以python pickle格式保存混合数据类型列，并且当前提供PerformanceWarning。

df.to_hdf(outpath, 'yourkey', format='fixed', mode='w')

使用Pandas从Excel转换为HDF5

1 个答案: