Question

我在hive表中有unicode数据，我试图用pyspark查看。

sqlContext.table('mytable').select("column_containing_unicode_data").show(1)

返回错误：

UnicodeEncodeError：'latin-1'编解码器无法对字符进行编码位置75-76：序数不在范围内（256）

有关我如何阅读此类数据的任何建议？我猜我需要更改默认编码。我试着发出：

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

但它没有效果。

Answer 1

想通了

from pyspark.sql.types import *
from pyspark.sql.functions import  udf
f=udf(lambda x: x.encode('utf-8'),StringType())
sqlContext.table('mytable').select(f("column_containing_unicode_data").show()

更改Hive数据的默认编码

1 个答案: