从Spark中的scala中的* .tar.gz压缩文件中读取HDF5文件

时间:2016-11-16 09:47:02

标签: scala apache-spark hdfs hdf5

在引用this post之后,我可以读取驻留在* .tar.gz文件中的多个* .txt文件。但是现在,我需要在* .tar.gz文件中读取HDF5文件。可以下载样本文件here,该文件是从million songs dataset生成的。谁能告诉我如何更改以下代码才能将HDF5文件读入RDD?谢谢!

package a.b.c

import org.apache.spark._
import org.apache.spark.sql.{SQLContext, DataFrame}
import org.apache.spark.ml.tuning.CrossValidatorModel
import org.apache.spark.ml.regression.LinearRegressionModel
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.input.PortableDataStream
import org.apache.commons.compress.archivers.tar.TarArchiveInputStream
import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream
import scala.util.Try
import java.nio.charset._

object Main {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("lab1").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    import sqlContext.implicits._
    import sqlContext._

    val inputpath = "path/to/millionsong.tar.gz"
    val rawDF = sc.binaryFiles(inputpath, 2)
                .flatMapValues(x => extractFiles(x).toOption)
                .mapValues(_.map(decode()))
                .map(_._2)
                .flatMap(x => x)
                .flatMap { x => x.split("\n") }
                .toDF()
  }

  def extractFiles(ps: PortableDataStream, n: Int = 1024) = Try {
    val tar = new TarArchiveInputStream(new GzipCompressorInputStream(ps.open))
    Stream.continually(Option(tar.getNextTarEntry))
      // Read until next exntry is null
      .takeWhile(_.isDefined)
      // flatten
      .flatMap(x => x)
      // Drop directories
      .filter(!_.isDirectory)
      .map(e => {
        Stream.continually {
          // Read n bytes
          val buffer = Array.fill[Byte](n)(-1)
          val i = tar.read(buffer, 0, n)
          (i, buffer.take(i))}
        // Take as long as we've read something
        .takeWhile(_._1 > 0)
        .map(_._2)
        .flatten
        .toArray})
      .toArray
  }

  def decode(charset: Charset = StandardCharsets.UTF_8)(bytes: Array[Byte]) = new String(bytes, StandardCharsets.UTF_8)
}

1 个答案:

答案 0 :(得分:0)

我设法通过将字节流写入本地文件然后以h5打开此文件来读取tarball中的HDF5文件,使用this提取功能。这是我的代码:

var tarFiles: Array[String] = Array()
val tar_path = path + "millionsongsubset.tar.gz"

//TODO: add all your tar.gz files in main folder path to tarFiles array
//should add here as many tar.gz files as wanted containing the
//hdf5 files for the songs
tarFiles = tarFiles :+ tar_path
//tarFiles = tarFiles :+ (path+"A.tar.gz")
//tarFiles = tarFiles :+ (path+"B.tar.gz")
//tarFiles = tarFiles :+ (path+"C.tar.gz")

//This reads all tar.gz files in tarFiles list, and for each .h5
//file within, it extracts each song's list of features.
//Thus, it gets a list of features for all songs in the files.
var allHDF5 = sc.parallelize(tarFiles).flatMap(path => { 
    val tar = new TarArchiveInputStream(new GzipCompressorInputStream(new FileInputStream(path)))
    var entry: TarArchiveEntry = tar.getNextEntry().asInstanceOf[TarArchiveEntry]
    var res: List[Array[Byte]] = List()
    var i = 0
    while (entry != null) {
        var outputFile:File = new File(entry.getName());
        if (!entry.isDirectory() && entry.getName.contains(".h5")) {
            var byteFile = Array.ofDim[Byte](entry.getSize.toInt)
            tar.read(byteFile);
            res = byteFile :: res
            if(i%100==0) {
              println("Read " + i + " files")
            }
            i = i+1

        }
        entry = tar.getNextEntry().asInstanceOf[TarArchiveEntry]
    }
    //All files are turned into byte arrays
    res

  } ).map(bytes => {
    // The toString method is used as a UUID for the file
     val name = bytes.toString()
     FileUtils.writeByteArrayToFile(new File(name), bytes)
     val reader = HDF5Factory.openForReading(name)
     val features = getFeatures(reader)
     reader.close()
     features
  })

  println("Extracted songs from tar.gz, showing 5 examples")
  allHDF5.take(5).foreach(x => { x.foreach(y => print(y+" "))
                       println()})

几条评论:

  1. getFeatures方法:此方法是对here中代码的一种非常简单的修改,提取了一些特性并返回了它们的数组。请注意,要运行此功能提取代码,您需要this library,其中包含一个良好的javadoc
  2. 请注意,如果此代码在具有多个执行程序的集群中运行,则执行程序会在本地写入.h5文件,因此如果它们在集群中移动,则在某些时候您可能会尝试读取不存在的文件在本地执行中。