我从这个文件中读取了unicode数据:
Mdt,Doccompra,OrgC,Cen,NumP,Criadopor,Dtcriacao,Fornecedor,P,Fun
400,8751215432,2581,,1,MIGRAÇÃO,01.10.2004,75852214,,TD
400,5464282154,9874,,1,MIGRAÇÃO,01.10.2004,78995411,,FO
我有两个问题:
1)当我尝试查询此unicode数据时,我得到UnicodeDecodeError
:
Traceback (most recent call last):
File "<ipython-input-1-4423dceb2b1d>", line 1, in <module>
runfile('C:/Users/u5en/Documents/SAP/Programação/Problema HDF.py', wdir='C:/Users/u5en/Documents/SAP/Programação')
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 48, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "C:/Users/u5en/Documents/SAP/Programação/Problema HDF.py", line 15, in <module>
store.select("EKKA", "columns=['Mdt', 'Fornecedor']")
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 665, in select
return it.get_result()
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 1359, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 658, in func
columns=columns, **kwargs)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 3968, in read
if not self.read_axes(where=where, **kwargs):
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 3201, in read_axes
a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2058, in convert
self.data, nan_rep=nan_rep, encoding=encoding)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4359, in _unconvert_string_array
data = f(data)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 1700, in __call__
return self._vectorize_call(func=func, args=vargs)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 1769, in _vectorize_call
outputs = ufunc(*inputs)
File "C:\Users\u5en\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4358, in <lambda>
f = np.vectorize(lambda x: x.decode(encoding), otypes=[np.object])
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 7: unexpected end of data
如何在hdf5中存储和查询我的unicode数据?
2)我有许多表格,我们事先并不知道列名,而且不是propper pytable名称(NaturalNameWarning)。我希望用户能够查询这些列,所以我想知道如果他们的名字阻止我,我怎么能查询这些。我认为这曾经有no easy fix,所以如果仍然如此,我只会从标题中删除有问题的字符。
import csv
import pandas as pd
dados = pd.read_csv("EKKA - Cópia.csv")
print(dados)
store= pd.HDFStore('teste.h5' , encoding="utf-8")
store.append("EKKA", dados, format="table", data_columns=True)
store.select("EKKA", "columns=['Mdt', 'Fornecedor']")
store.close()
sqlite
环境:
Windows 7 64位
熊猫15.2
NumPy 1.9.2
答案 0 :(得分:1)
因此,在Windows 7上的Python 2.7,pandas 0.15.2中,一切都按预期工作,无需编码。但是在Python 3.4上,以下内容对我有用。显然,某些字符在'utf-8&#39;中不可表示; &#39; LATIN1&#39;编码通常可以解决这些问题。请注意,我必须首先使用此编码读取csv。
>>> df = pd.read_csv('../../test.csv',encoding='latin1')
>>> df
Mdt Doccompra OrgC Cen NumP Criadopor Dtcriacao Fornecedor P Fun
0 400 8751215432 2581 NaN 1 MIGRAÇ\xc3O 01.10.2004 75852214 NaN TD
1 400 5464282154 9874 NaN 1 MIGRAÇ\xc3O 01.10.2004 78995411 NaN FO
此外,必须在打开商店时指定编码,而不是在append/put
调用
>>> df.to_hdf('test.h5','df',format='table',mode='w',data_columns=True,encoding='latin1')
>>> pd.read_hdf('test.h5','df')
Mdt Doccompra OrgC Cen NumP Criadopor Dtcriacao Fornecedor P Fun
0 400 8751215432 2581 NaN 1 MIGRAÇ\xc3O 01.10.2004 75852214 NaN TD
1 400 5464282154 9874 NaN 1 MIGRAÇ\xc3O 01.10.2004 78995411 NaN FO
编码后,无需在阅读时指定编码。