Question

我想知道在包含字符串列时，是否有将pandas数据帧保存到hdf的好方法。

鉴于数据框：

In [6]: df.head()                                                                                                                                                                                                  
Out[6]:                                                                                                                                                                                                            
   Protocol           Src   Bytes                                                                                                                                                                                  
10     ICMP           NaN    1062                                                                                                                                                                                  
11     ICMP     10.2.0.74    2146                                                                                                                                                                                  
12     ICMP  10.100.100.1  857520                                                                                                                                                                                  
13     ICMP  10.100.100.2  857520                                                                                                                                                                                  
14     ICMP  10.100.100.2    7000

df.to_hdf('save.h5' ,'table')导致：

/home/lpuggini/MyApps/python_2_7_numerical/lib/python2.7/site-packages/pandas/core/generic.py:1138: PerformanceWarning:                                                                                            
your performance may suffer as PyTables will pickle object types that it cannot                                                                                                                                    
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['Protocol', 'Src']]                                                                                                                     

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

可以避免此消息将列转换为str：

df['Src'] = df['Src'].apply(str)

但随后np.nan也会保存为'nan'

是否有更好的方法来保存包含string和np.nan列的数据框？

Answer 1

HDF文件中的列必须是单个dtype。 nan由内部float表示为numpy。您可以通过以下方法将nan值替换为空字符串：

df['src'].fillna('')

HDF在数字类型上的表现要比字符串好得多，因此将IP地址转换为整数类型可能更有意义。

编辑：请参阅下面的@ Jeff注释。以上情况适用于format =＆＃39; fixed＆＃39;。

Edit2：根据docs，你可以为字符串dtype cols指定nan的磁盘表示：

df.to_hdf((...), nan_rep='whatever you want')

pandas：如何使用包含np.nan的字符串列保存到hdf数据帧

1 个答案: