如何读取多个Excel文件并将它们连接成一个Apache Spark DataFrame?

时间:2017-03-12 14:38:39

标签: excel scala apache-spark apache-spark-dataset

最近我想从Spark Summit 2016做Spark机器学习实验室。培训视频是here,导出的笔记本可用here.

实验室中使用的数据集可以从UCI Machine Learning Repository下载。它包含来自燃气发电厂中各种传感器的一组读数。格式为xlsx文件,有五张。

要使用实验室中的数据,我需要从Excel文件中读取所有工作表,并将它们连接到一个Spark DataFrame中。在培训期间,他们正在使用Databricks Notebook,但我正在使用带有Scala的IntelliJ IDEA并在控制台中评估代码。

第一步是将所有Excel工作表保存到名为#! /bin/env python3 # -*- coding: utf-8 import time, random, re def replace1( sentences ): for n, sentence in enumerate( sentences ): for search, repl in patterns: sentence = re.sub( "\\b"+search+"\\b", repl, sentence ) def replace2( sentences ): for n, sentence in enumerate( sentences ): for search, repl in patterns_comp: sentence = re.sub( search, repl, sentence ) def replace3( sentences ): pd = patterns_dict.get for n, sentence in enumerate( sentences ): #~ print( n, sentence ) # Split the sentence on non-word characters. # Note: () in split patterns ensure the non-word characters ARE kept # and returned in the result list, so we don't mangle the sentence. # If ALL separators are spaces, use string.split instead or something. # Example: #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf") #~ ['ab', ' ', 'céé', '? . ', 'd2eéf'] words = re.split(r"([^\w]+)", sentence) # and... done. sentence = "".join( pd(w,w) for w in words ) #~ print( n, sentence ) def replace4( sentences ): pd = patterns_dict.get def repl(m): w = m.group() return pd(w,w) for n, sentence in enumerate( sentences ): sentence = re.sub(r"\w+", repl, sentence) # Build test set test_words = [ ("word%d" % _) for _ in range(50000) ] test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ] # Create search and replace patterns patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ] patterns_dict = dict( patterns ) patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ] def test( func, num ): t = time.time() func( test_sentences[:num] ) print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t))) print( "Sentences", len(test_sentences) ) print( "Words ", len(test_words) ) test( replace1, 1 ) test( replace2, 10 ) test( replace3, 1000 ) test( replace4, 1000 ) sheet1.xlxs等单独的xlsx文件中,并将它们放入sheet2.xlsx目录。

如何阅读所有Excel文件并将它们连接成一个Apache Spark DataFrame?

3 个答案:

答案 0 :(得分:3)

为此,我使用了spark-excel包。它可以添加到build.sbt文件中:sheets

在IntelliJ IDEA Scala控制台中执行的代码是:

libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"

控制台输出:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File

val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")

val spark = SparkSession.builder().getOrCreate()

// Function to read xlsx file using spark-excel. 
// This code format with "trailing dots" can be sent to IJ Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
  format("com.crealytics.spark.excel").
  option("location", file).
  option("useHeader", "true").
  option("treatEmptyValuesAsNulls", "true").
  option("inferSchema", "true").
  option("addColorColumns", "False").
  load()

val dir = new File("./data/CCPP/sheets")
val excelFiles = dir.listFiles.sorted.map(f => f.toString)  // Array[String]

val dfs = excelFiles.map(f => readExcel(f))  // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_))  // DataFrame 

ppdf.count()  // res3: Long = 47840
ppdf.show(5)

答案 1 :(得分:0)

希望此Spark Scala代码可能有所帮助。

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI

def listFiles(basep: String, globp: String): Seq[String] = {
  val conf = new Configuration(sc.hadoopConfiguration)
  val fs = FileSystem.get(new URI(basep), conf)

  def validated(path: String): Path = {
    if(path startsWith "/") new Path(path)
    else new Path("/" + path)
  }

  val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
    paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
    hadoopConf = conf,
    filter = null,
    sparkSession = spark)

  fileCatalog.flatMap(_._2.map(_.path))
}

val root = "/mnt/{path to your file directory}"
val globp = "[^_]*"

val files = listFiles(root, globp)
val paths=files.toVector

环绕向量以读取多个文件:

for (path <- paths) {
     print(path.toString)

     val df= spark.read.
                   format("com.crealytics.spark.excel").
                   option("useHeader", "true").
                   option("treatEmptyValuesAsNulls", "false").
                   option("inferSchema", "false"). 
                   option("addColorColumns", "false").
                   load(path.toString)
}

答案 2 :(得分:-1)

我们需要spark-excel库,可以从

获得

https://github.com/crealytics/spark-excel#scala-api

  1. 从github链接克隆git项目并使用“sbt package”构建
  2. 使用Spark 2运行spark-shell
  3.   

    spark-shell --driver-class-path ./spark-excel_2.11-0.8.3.jar   --master =纱的客户端

    1. 导入必要的
    2.   

      import org.apache.spark.sql._
        import org.apache.spark.sql.functions._
        val sqlContext = new SQLContext(sc)

      1. 设置Excel文档路径
      2.   

        val document = "path to excel doc"

        1. 执行以下功能以创建数据帧
        2. val dataDF = sqlContext.read
                                    .format("com.crealytics.spark.excel")
                                    .option("sheetName", "Sheet Name")
                                    .option("useHeader", "true")
                                    .option("treatEmptyValuesAsNulls", "false")
                                    .option("inferSchema", "false")
                                    .option("location", document)
                                    .option("addColorColumns", "false")
                                    .load(document)
          

          这就是全部!现在,您可以对 dataDF 对象执行Dataframe操作。