我有用亚洲字符编码的csv文件....(让我们用UTF-8)。
当我尝试将csv转换为Pandas HDFStore时,
我需要在附加到Pandas HDFSstore之前处理unicode和min_itemsize。
如何知道包含UTF-8(亚洲字符)字符串的一个数据帧列的最大大小?
编辑:亚洲文:
SMALL_AREA_NAME,PREF_NAME,COUPON_ID_hash
埼玉,埼玉県,6b263844241eea98c5a97f1335ea82af
新宿・高田馬場・中野・吉祥寺,東京都,e0a410ff611abefbfb57ca262dcdf42e
銀座・新橋・東京・上野,東京都,b286f6fb50a4f849e4382c9752405d7a
编辑2: 似乎unicode有HDFStore附加问题返回错误: (Python 2.7,因其他软件包冲突而无法使用Python 3)。
for col in col_list :
df_i[col] = df_i[col].map(lambda x: x.encode('utf-8'))
max_size= df_i[col].str.len().max()
store.append(tablename, df_i, format='table', encoding="utf-8", min_itemsize=max_size)
返回此错误:
Traceback (most recent call last):
File "D:\_devs\Python01\Anaconda27\lib\site-packages\IPython\core\interactiveshell.py", line 2885, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-69-e96ff71ee569>", line 26, in <module>
store.append(tablename, df_i, format='table', encoding="utf-8", min_itemsize=max_size)
File "D:\_devs\Python01\Anaconda27\lib\site-packages\pandas\io\pytables.py", line 919, in append
**kwargs)
File "D:\_devs\Python01\Anaconda27\lib\site-packages\pandas\io\pytables.py", line 1264, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "D:\_devs\Python01\Anaconda27\lib\site-packages\pandas\io\pytables.py", line 3787, in write
**kwargs)
File "D:\_devs\Python01\Anaconda27\lib\site-packages\pandas\io\pytables.py", line 3460, in create_axes
raise e
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
答案 0 :(得分:1)
更新:在Python 2.7下进行测试
Python 2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.
IPython 5.1.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: df = pd.read_clipboard()
In [2]: df
Out[2]:
a b
0 1 hi
1 2 привіт
2 3 Grüßi
In [3]: store = pd.HDFStore('d:/temp/test_py27.h5')
In [4]: store.append('test', df)
In [5]: store.get_storer('test').table
Out[5]:
/test/table (Table(3,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int64Col(shape=(1,), dflt=0, pos=1),
"values_block_1": StringCol(itemsize=12, shape=(1,), dflt='', pos=2)}
byteorder := 'little'
chunkshape := (2340,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
旧回答:
您可以使用Series.str.len().max()
:
演示:
In [91]: df
Out[91]:
A
0 aaa.bbbbbbb
1 ccc,xxxxxxxxxxxxxx
2 xxxxx.zzz
In [92]: df.A.str.len()
Out[92]:
0 11
1 18
2 9
Name: A, dtype: int64
In [93]: df.A.str.len().max()
Out[93]: 18
答案 1 :(得分:0)
对于记录,这个在Python 2.7中工作:(删除最后一个编码)
for col in col_list :
df_i[col] = df_i[col].map(lambda x: str(x.encode('utf-8')))
max_size= df_i[col].str.len().max()
store.append(tablename, df_i, format='table', min_itemsize=max_size)