Question

我将数据从MySQL导入到hdfs作为镶木地板文件，并在其上构建一个hive外部表，但该文件中的几个不需要的控制字符也被加载到hive表中。我需要用空字符串替换那些。我尝试过猪，但没有运气。以下是返回问题的火花代码。

PYSPARK代码：

sc = spark.sparkContext
# using SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
# to read parquet file
df = sqlContext.read.parquet('path-to-file/file.parquet')
df1= df.replace(['\xa0'],[''])
df1.write.parquet('path-to-file/replaced_files')

ISSUE：

UnicodeDecodeError utf8编解码器无法决定位置0中的字节0xa0：无效的起始字节

请建议我如何解决这个火花问题，并告诉我们是否可以使用PIG或任何其他方式处理这些控制角色。

先谢谢。

Answer 1

如果您正在使用SQOOP，请在import命令中使用--query选项，并使用下面的replace语句替换xa0，它是根据unicode字符集的char（160）

replace(input_string, char(160), ' ')

需要替换存储在hdfs中的镶木地板文件中的控制字符

1 个答案: