我有一个适用于大多数CSV的粘合模式。但有时客户会上传带有CP1252编码字段的CSV。当我尝试对此类文件中的dynamicFrame执行任何操作时,出现错误:
datasource0.toDF().show(1)
An error occurred while calling o221.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5.0 (TID 17, ip-172-31-11-73.ec2.internal, executor 2): com.amazonaws.services.glue.util.FatalException: Unable to parse file: xxxx11142018.csv
我尝试对每行进行解码/编码。但这不起作用:
def sanitizeString(rawValue):
value = re.sub('\\s+', ' ', rawValue)
value = re.sub('\\\\', '', value)
value = value.decode('CP1252').encode('utf-8')
return value
def sanitizeLine(rawLine):
line = rawLine.map(sanitizeString)
datasourceRaw = glueContext.create_dynamic_frame.from_catalog(database = "glue_db", table_name = "glue_table", transformation_ctx = "datasourceRaw")
datasource0 = datasourceRaw.map(sanitizeLine)
# datasource3.toDF().show(1)
它给出了完全相同的错误
我还尝试使用
通过RDD加载它rdd0 = sc.textFile(filepath, use_unicode=False)
rdd0 = rdd0.map(sanitizeLine) # have to add an extra line to sanitizeLine() to split() the line
rdd0.toDF().show(2)
这可行,但是有两个问题:
,
上拆分,但是接下来我想将其转换回DynamicFrame,为此我需要一个模式。我尝试从目录中获取它,但是(您猜对了),我得到了相同的错误:-
datasourceRaw = glueContext.create_dynamic_frame.from_catalog(database = "glue_db", table_name = "glue_table", transformation_ctx = "datasourceRaw")
print datasourceRaw.schema()
An error occurred while calling o434.schema.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 9.0 failed 4 times, most recent failure: Lost task 1.3 in stage 9.0 (TID 47, ip-172-31-11-73.ec2.internal, executor 5): com.amazonaws.services.glue.util.FatalException: Unable to parse file: xxxx_11142018.csv