我是 Scala、Spark 的新手,所以我在尝试创建一个地图函数时苦苦挣扎。 Dataframe a Row (org.apache.spark.sql.Row) 上的 map 函数 我一直在关注 this 篇文章。
val rddWithExceptionHandling = filterValueDF.rdd.map { row: Row =>
val parsed = Try(from_avro(???, currentValueSchema.value, fromAvroOptions)) match {
case Success(parsedValue) => List(parsedValue, null)
case Failure(ex) => List(null, ex.toString)
}
Row.fromSeq(row.toSeq.toList ++ parsed)
}
from_avro
函数想要接受一个列 (org.apache.spark.sql.Column),但是我在文档中没有看到从行中获取列的方法。
我完全接受我可能做错了整件事的想法。 最终我的目标是解析来自 Structure Stream 的字节。 解析的记录写入Delta Table A,失败的记录写入另一个Delta Table B
对于上下文,源表如下所示:
编辑 - from_avro
在“错误记录”上返回 null
有一些评论说,如果 from_avro
无法解析“坏记录”,则返回 null。默认情况下,from_avro
使用模式 FAILFAST
,如果解析失败,它将抛出异常。如果将模式设置为 PERMISSIVE
,则返回模式形状的对象,但所有属性都为 null(也不是特别有用......)。链接到 Apache Avro Data Source Guide - Spark 3.1.1 Documentation
这是我的原始命令:
val parsedDf = filterValueDF.select($"topic",
$"partition",
$"offset",
$"timestamp",
$"timestampType",
$"valueSchemaId",
from_avro($"fixedValue", currentValueSchema.value, fromAvroOptions).as('parsedValue))
如果有任何错误的行,作业将被 org.apache.spark.SparkException: Job aborted.
异常日志片段:
Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing. Current parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:111)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:732)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$2(FileFormatWriter.scala:291)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1615)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:300)
... 10 more
Suppressed: java.lang.NullPointerException
at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.write(NativeAzureFileSystem.java:1099)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:50)
at shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
at shaded.parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:107)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:482)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:489)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:252)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBegin(TCompactProtocol.java:234)
at org.apache.parquet.format.InterningProtocol.writeFieldBegin(InterningProtocol.java:74)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1184)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1051)
at org.apache.parquet.format.FileMetaData.write(FileMetaData.java:949)
at org.apache.parquet.format.Util.write(Util.java:222)
at org.apache.parquet.format.Util.writeFileMetaData(Util.java:69)
at org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:757)
at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:750)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:135)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:58)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.abort(FileFormatDataWriter.scala:84)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$3(FileFormatWriter.scala:297)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1626)
... 11 more
Caused by: java.lang.ArithmeticException: Unscaled value too large for precision
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:83)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
at org.apache.spark.sql.avro.AvroDeserializer.createDecimal(AvroDeserializer.scala:308)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16(AvroDeserializer.scala:177)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16$adapted(AvroDeserializer.scala:174)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:336)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:332)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:354)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:351)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:75)
at org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:89)
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
... 16 more
答案 0 :(得分:1)
为了从 Row 对象中获取特定列,您可以使用 row.get(i)
或将列名与 row.getAs[T]("columnName")
一起使用。 Here 可以查看 Row 类的详细信息。
然后您的代码将如下所示:
val rddWithExceptionHandling = filterValueDF.rdd.map { row: Row =>
val binaryFixedValue = row.getSeq[Byte](6) // or row.getAs[Seq[Byte]]("fixedValue")
val parsed = Try(from_avro(binaryFixedValue, currentValueSchema.value, fromAvroOptions)) match {
case Success(parsedValue) => List(parsedValue, null)
case Failure(ex) => List(null, ex.toString)
}
Row.fromSeq(row.toSeq.toList ++ parsed)
}
尽管在您的情况下,您实际上并不需要进入 map 函数,因为当 from_avro
与 Dataframe API 一起使用时,您必须使用原始 Scala 类型。这就是您不能直接从 from_avro
调用 map
的原因,因为 Column
类的实例可以仅用于结合Dataframe API,即:df.select($"c1")
,这里c1是Column的一个实例。要按照您最初的意图使用 from_avro
,只需键入:
filterValueDF.select(from_avro($"fixedValue", currentValueSchema))
正如@mike 已经提到的,如果 from_avro
解析失败,AVRO 内容将返回 null。最后,如果您想将成功的行与失败的行分开,您可以执行以下操作:
val includingFailuresDf = filterValueDF.select(
from_avro($"fixedValue", currentValueSchema) as "avro_res")
.withColumn("failed", $"avro_res".isNull)
val successDf = includingFailuresDf.where($"failed" === false)
val failedDf = includingFailuresDf.where($"failed" === true)
请注意,代码未经测试。
答案 1 :(得分:0)
据我所知,您只需要为一行获取一列。您可以通过使用 row.get()
在特定索引处获取列值来做到这一点