Question

我有用亚洲字符编码的csv文件....（让我们用UTF-8）。

当我尝试将csv转换为Pandas HDFStore时，

我需要在附加到Pandas HDFSstore之前处理unicode和min_itemsize。

如何知道包含UTF-8（亚洲字符）字符串的一个数据帧列的最大大小？

编辑：亚洲文：

SMALL_AREA_NAME,PREF_NAME,COUPON_ID_hash

埼玉,埼玉県,6b263844241eea98c5a97f1335ea82af
新宿・高田馬場・中野・吉祥寺,東京都,e0a410ff611abefbfb57ca262dcdf42e
銀座・新橋・東京・上野,東京都,b286f6fb50a4f849e4382c9752405d7a

编辑2：似乎unicode有HDFStore附加问题返回错误：（Python 2.7，因其他软件包冲突而无法使用Python 3）。

  for col in col_list :
     df_i[col] = df_i[col].map(lambda x:  x.encode('utf-8'))
     max_size= df_i[col].str.len().max() 

  store.append(tablename, df_i, format='table', encoding="utf-8", min_itemsize=max_size)

返回此错误：

Traceback (most recent call last):
  File "D:\_devs\Python01\Anaconda27\lib\site-packages\IPython\core\interactiveshell.py", line 2885, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-69-e96ff71ee569>", line 26, in <module>
    store.append(tablename, df_i, format='table', encoding="utf-8", min_itemsize=max_size)
  File "D:\_devs\Python01\Anaconda27\lib\site-packages\pandas\io\pytables.py", line 919, in append
    **kwargs)
  File "D:\_devs\Python01\Anaconda27\lib\site-packages\pandas\io\pytables.py", line 1264, in _write_to_group
    s.write(obj=value, append=append, complib=complib, **kwargs)
  File "D:\_devs\Python01\Anaconda27\lib\site-packages\pandas\io\pytables.py", line 3787, in write
    **kwargs)
  File "D:\_devs\Python01\Anaconda27\lib\site-packages\pandas\io\pytables.py", line 3460, in create_axes
    raise e
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

Answer 1

更新：在Python 2.7下进行测试

Python 2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: df = pd.read_clipboard()

In [2]: df
Out[2]:
   a       b
0  1      hi
1  2  привіт
2  3   Grüßi

In [3]: store = pd.HDFStore('d:/temp/test_py27.h5')

In [4]: store.append('test', df)

In [5]: store.get_storer('test').table
Out[5]:
/test/table (Table(3,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Int64Col(shape=(1,), dflt=0, pos=1),
  "values_block_1": StringCol(itemsize=12, shape=(1,), dflt='', pos=2)}
  byteorder := 'little'
  chunkshape := (2340,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

旧回答：

您可以使用Series.str.len().max()：

演示：

In [91]: df
Out[91]:
                    A
0         aaa.bbbbbbb
1  ccc,xxxxxxxxxxxxxx
2           xxxxx.zzz

In [92]: df.A.str.len()
Out[92]:
0    11
1    18
2     9
Name: A, dtype: int64

In [93]: df.A.str.len().max()
Out[93]: 18

Answer 2

对于记录，这个在Python 2.7中工作:(删除最后一个编码）

  for col in col_list :
     df_i[col] = df_i[col].map(lambda x:  str(x.encode('utf-8')))
     max_size= df_i[col].str.len().max() 

  store.append(tablename, df_i, format='table', min_itemsize=max_size)

在附加之前计算Pandas HFStore列字符串的min_itemsize

2 个答案: