Question

因为我想从binaryFiles中提取数据，所以我使用 val dataRDD = sc.binaryRecord("Path")得到的结果为org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)]

我想提取PortableDataStream形式的文件内容

为此，我尝试了：val data = dataRDD.map(x => x._2.open()).collect() 但我收到以下错误： java.io.NotSerializableException:org.apache.hadoop.hdfs.client.HdfsDataInputStream

如果您有想法我该如何解决我的问题，请帮助！

非常感谢。

Answer 1

实际上，PortableDataStream是可序列化的。这就是它的意思。但是，open()返回一个简单的DataInputStream（在您的情况下，HdfsDataInputStream是因为您的文件位于HDFS上），该序列不可序列化，因此会出现错误。

实际上，当您打开PortableDataStream时，您只需要立即读取数据。在scala中，您可以使用scala.io.Source.fromInputStream：

val data : RDD[Array[String]] = sc
    .binaryFiles("path/.../")
    .map{ case (fileName, pds) => {
        scala.io.Source.fromInputStream(pds.open())
            .getLines().toArray
    }}

此代码假定数据为文本。如果不是，则可以使其适应于读取任何类型的二进制数据。这是一个创建字节序列的示例，您可以按所需的方式处理该字节。

val rdd : RDD[Seq[Byte]] = sc.binaryFiles("...")
    .map{ case (file, pds) => {
        val dis = pds.open()
        val bytes = Array.ofDim[Byte](1024)
        val all = scala.collection.mutable.ArrayBuffer[Byte]()
        while( dis.read(bytes) != -1) {
            all ++= bytes
        }
        all.toSeq
    }}

有关更多可能性，请参见javadoc of DataInputStream。例如，它拥有readLong，readDouble（依此类推）方法。

Answer 2

val bf    = sc.binaryFiles("...")
val bytes = bf.map{ case(file, pds) => {
    val dis = pds.open()
    val len = dis.available();
    val buf = Array.ofDim[Byte](len)
    pds.open().readFully(buf)
    buf
}}
bytes: org.apache.spark.rdd.RDD[Array[Byte]] = MapPartitionsRDD[21] at map at <console>:26

scala> bytes.take(1)(0).size
res15: Int = 5879609  // this happened to be the size of my first binary file

Scala：如何从RDD获取PortableDataStream实例的内容

2 个答案: