我正在尝试将大型csv数据集读入PySpark。有些数据格式不正确,不符合我指定的架构。
我正在尝试从数据框中删除这些格式错误的行:
df_raw=spark.read\
.format("org.apache.spark.csv") \
.option("header","true")\
.option("quote",'"') \
.option("mode", "DROPMALFORMED") \
.schema(df_schema) \
.csv(input_file)
然而,只要遇到格式错误的行,它似乎会终止执行者的工作:
16/11/04 11:24:47 WARN CSVRelation: Dropping malformed line:
888800017810876000, 10.61,D,10792516955,,,aa999,,"19 Y1U ""R""",EO,,
"10 Y1U ""R"" AI2, YT IA", XXXXXXXXYYYYYYYYYZZZZZZZZZ,
63.0, going great, 2016-05-17,436,2016-05-17,SOMECODE
16/11/04 11:28:14 ERROR Utils: Uncaught exception in thread stdout writer for python
java.net.SocketException: socket already closed
似乎导致内存飙升。任何人都可以解释发生了什么并建议可能的解决方法吗?