Question

我试图通过指定其中一列的值来选择行。只要选择的值是纯粹的ascii，那就非常有效。但是，如果它包含非ascii字符，无论我如何对值进行编码，我都无法使其工作。

说明问题的简化示例：

>>> from __future__ import (absolute_import, division, 
                            print_function, unicode_literals)
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 'Stuttgart'], [2, 'München']], columns=['id', 'city'])
>>> df['city'] = df['city'].map(lambda x: x.encode('latin-1'))
>>> store = pd.HDFStore('test_store.h5')
>>> store.append('test_key', df, data_columns=True)
>>> store['test_key']
   id       city
0   1  Stuttgart
1   2    M�nchen

请注意，非asci字符串确实已正确存储：

>>> store['test_key']['city'][1]
'M\xfcnchen'

选择asci值可以正常工作：

>>> store.select('test_key', where='city==%r' % 'Stuttgart')
   id       city
0   1  Stuttgart

但是选择非ascii值无法返回行：

>>> store.select('test_key', where='city==%r' % 'München')
Empty DataFrame
Columns: [id, city]
Index: []

>>> store.select('test_key', where='city==%r' % 'München'.encode('latin-1'))
Empty DataFrame
Columns: [id, city]
Index: []

显然我做错了什么......如何解决这个问题？

Answer 1

奇怪的是，如果编码是utf-8而不是latin-1，选择似乎工作正常：

from __future__ import (absolute_import, division, 
                        print_function, unicode_literals)

import pandas as pd

df = pd.DataFrame([[1, 'Stuttgart'], [2, 'München']], columns=['id', 'city'])
df['city'] = df['city'].map(lambda x: x.encode('utf-8'))
store = pd.HDFStore('/tmp/test_store.h5', 'w')
store.append('test_key', df, data_columns=True)
print(store.select('test_key', where='city==%r' % 'Stuttgart'.encode('utf-8')))
#    id       city
# 0   1  Stuttgart

print(store.select('test_key', where='city==%r' % 'München'.encode('utf-8')))
#    id     city
# 1   2  München

store.close()

Answer 2

看起来PyTables 3.1.1可能不支持unicode列。我不是PyTables的用户，但是bug report表明这是一个已知的问题，推迟到3.2版本。这个other issue可能是相关的。

Pandas：选择带有unicode字符的字符串

2 个答案: