如何在AWS Glue PySpark中读取非UTF-8文件?

时间:2018-11-16 03:22:14

标签: utf-8 character-encoding pyspark etl aws-glue

我有一个适用于大多数CSV的粘合模式。但有时客户会上传带有CP1252编码字段的CSV。当我尝试对此类文件中的dynamicFrame执行任何操作时,出现错误: datasource0.toDF().show(1)

An error occurred while calling o221.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5.0 (TID 17, ip-172-31-11-73.ec2.internal, executor 2): com.amazonaws.services.glue.util.FatalException: Unable to parse file: xxxx11142018.csv


def sanitizeString(rawValue):
    value = re.sub('\\s+', ' ', rawValue)
    value = re.sub('\\\\', '', value)
    value = value.decode('CP1252').encode('utf-8')
    return value

def sanitizeLine(rawLine):
    line = rawLine.map(sanitizeString)

datasourceRaw = glueContext.create_dynamic_frame.from_catalog(database = "glue_db", table_name = "glue_table", transformation_ctx = "datasourceRaw")
datasource0 = datasourceRaw.map(sanitizeLine)
# datasource3.toDF().show(1)



rdd0 = sc.textFile(filepath, use_unicode=False)
rdd0 = rdd0.map(sanitizeLine) # have to add an extra line to sanitizeLine() to split() the line


  1. 无论出于何种原因,它的速度都比使用DynamicFrame慢
  2. 它给我整行作为要映射的字符串。我想我可以在,上拆分,但是接下来我想将其转换回DynamicFrame,为此我需要一个模式。我尝试从目录中获取它,但是(您猜对了),我得到了相同的错误:


datasourceRaw = glueContext.create_dynamic_frame.from_catalog(database = "glue_db", table_name = "glue_table", transformation_ctx = "datasourceRaw")
print datasourceRaw.schema()

An error occurred while calling o434.schema.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 9.0 failed 4 times, most recent failure: Lost task 1.3 in stage 9.0 (TID 47, ip-172-31-11-73.ec2.internal, executor 5): com.amazonaws.services.glue.util.FatalException: Unable to parse file: xxxx_11142018.csv

