Flattening a nested ORC file with Spark - Performance issue

时间:2017-08-05 11:43:35

标签: scala apache-spark nested orc

We are facing a severe performance issue when reading a nested ORC file.

This is our ORC schema:

|-- uploader: string (nullable = true)
|-- email: string (nullable = true)
|-- data: array (nullable = true)
|    |-- element: struct (containsNull = true) 
|    |    |-- startTime: string (nullable = true)
|    |    |-- endTime: string (nullable = true)
|    |    |-- val1: string (nullable = true)
|    |    |-- val2: string (nullable = true)
|    |    |-- val3: integer (nullable = true)
|    |    |-- val4: integer (nullable = true)
|    |    |-- val5: integer (nullable = true)
|    |    |-- val6: integer (nullable = true)

The ‘data’ array could potentially contain 75K objects.

In our spark application, we flatten this ORC, as you can see below:

val dataFrame = spark.read.orc(files: _*)
val withData = dataFrame.withColumn("data", explode(dataFrame.col("data")))
val withUploader = withData.select($"uploader", $"data")
val allData = withUploader
  .withColumn("val_1", $"data.val1")
  .withColumn("val_2", $"data.val2")
  .withColumn("val_3", $"data.val3")
  .withColumn("val_4", $"data.val4")
  .withColumn("val_5", $"data.val5")
  .withColumn("val_6", $"data.val6")
  .withColumn("utc_start_time", timestampUdf($"data.startTime"))
  .withColumn("utc_end_time", timestampUdf($"data.endTime"))

allData.drop("data")

The flattening process seems to be a very heavy operation: Reading a 2MB ORC file with 20 records, each of which contains a data array with 75K objects, results in hours of processing time. Reading the file and collecting it without flattening it, takes 22 seconds.

Is there a way to make spark process the data faster?

2 个答案:

答案 0 :(得分:2)

I'd try to avoid large explodes completely. With 75K elements in the array:

  • You create 75K Row objects per Row. This is a huge allocation effort.
  • You duplicate uploaded and email 75K times. In short term it will reference the same data, but once data is serialized and deserialized with internal format, they'll like point to different objects effectively multiplying memory requirements.

Depending on the transformations you want to apply it might be the case where using UDF to process arrays as whole, will be much more efficient.

答案 1 :(得分:0)

如果这对某人有帮助,我发现使用flatmap展平数据要比使用flatde展开数据快得多:

dataFrame.as[InputFormat].flatMap(r => r.data.map(v => OutputFormat(v, r.tenant)))

表现的改善是戏剧性的。

处理一个包含20条记录的文件,每条记录包含一个250K行的数组 - 使用爆炸实现花费8小时,使用flatmap实现 - 7分钟(!)