Question

我很好奇是否在使用Scala案例类中定义的模式与使用Apache Avro for Spark数据集定义模式之间存在任何显着的性能差异。目前我的架构看起来像这样：

root
 |-- uniqueID: string (nullable = true)
 |-- fieldCount: integer (nullable = false)
 |-- fieldImportance: integer (nullable = false)
 |-- fieldPrimaryName: string (nullable = true)
 |-- fieldSecondaryName: string (nullable = true)
 |-- samples: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- value1: byte (nullable = false)
 |    |    |-- value2: byte (nullable = false)
 |    |    |-- value3: byte (nullable = false)

相应的案例类看起来像

case class FieldSample(uniqueID: String, 
                       fieldCount: Int, 
                       fieldImportance: Int,
                       fieldPrimaryName: Int,
                       fieldSecondaryName: Int, 
                       samples: Map[String, ValueStruct])

case class ValueStruct(value1: Byte,
                       value2: Byte,
                       value3: Byte)

我使用scala案例类实现了这一点，但是我发现从磁盘读取时存在相当大的瓶颈。数据以镶木地板格式保存在磁盘上。我想知道的是，在这种情况下，使用Avro架构而不是scala案例类是否有任何性能优势。我的猜测是嵌套模式导致镶木地板读取缓慢，所以我想知道Avro序列化是否以这种方式提供任何性能升级。谢谢！

Spark数据集的Avro Schema与Scala案例类

0 个答案: