Databricks Connect和外部库

时间:2019-11-17 13:03:16

标签: scala apache-spark databricks azure-databricks

我正在Azure上使用数据块,并且我的过程的一部分包括使用TwoSigma的Flint。我将该库上传到databricks库,并且能够在databricks工作区的笔记本中运行以下示例代码。

当我尝试使用databricks-connect时出现问题。尽管通常一切正常,但是当尝试使用外部库(包括Flint时,在spark-shell --packages 'com.twosigma:flint:0.6.0'下运行的以下代码会产生以下错误。

import org.apache.spark.sql.functions._
import com.twosigma.flint.timeseries.TimeSeriesRDD
import scala.concurrent.duration._
import spark.implicits._
val df = Seq(("2018-08-20", 1.0), ("2018-08-20", 2.0), ("2018-08-21", 3.0)).toDF("time", "number").withColumn("time", from_utc_timestamp($"time", "UTC"))
val tsRdd = TimeSeriesRDD.fromDF(df)(isSorted=false, timeUnit=DAYS)
val results = tsRdd.groupByCycle()
results.toDF.show

错误如下:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 78.0 failed 4 times, most recent failure: Lost task 2.3 in stage 78.0 (TID 2164, 10.139.64.7, executor 0): java.lang.ClassCastException: org.apache.spark.sql.types.StructField cannot be cast to java.lang.Integer
        at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
        at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)
        at com.twosigma.flint.timeseries.TimeSeriesStore$.getInternalRowConverter(TimeSeriesStore.scala:108)
        at com.twosigma.flint.timeseries.TimeSeriesStore$$anonfun$2.apply(TimeSeriesStore.scala:53)
        at com.twosigma.flint.timeseries.TimeSeriesStore$$anonfun$2.apply(TimeSeriesStore.scala:52)
        at org.apache.spark.rdd.RDD$client1f520962c6$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
        at org.apache.spark.rdd.RDD$client1f520962c6$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
        at org.apache.spark.scheduler.Task.run(Task.scala:112)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

我是否以错误的方式指定了依赖项?

1 个答案:

答案 0 :(得分:-1)

要指定依赖项,您需要将它们作为在整个集群中共享的JAR文件包括在内。来自docs

  

通常,您的主类或Python文件将具有其他依赖关系JAR和文件。您可以通过调用sparkContext.addJar("path-to-the-jar")sparkContext.addPyFile("path-to-the-file")添加此类依赖关系JAR和文件。您也可以使用addPyFile()界面添加Egg文件和zip文件。每次在IDE中运行代码时,依赖关系JAR和文件都会安装在集群上。

这是Scala中的一个示例(同样来自文档):

package com.example

import org.apache.spark.sql.SparkSession

case class Foo(x: String)

object Test {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      ...
      .getOrCreate();
    spark.sparkContext.setLogLevel("INFO")

    println("Running simple show query...")
    spark.read.parquet("/tmp/x").show()

    println("Running simple UDF query...")

    // Adding external library to project
    spark.sparkContext.addJar("./target/scala-2.11/hello-world_2.11-1.0.jar")
    spark.udf.register("f", (x: Int) => x + 1)
    spark.range(10).selectExpr("f(id)").show()

    println("Running custom objects query...")
    val objs = spark.sparkContext.parallelize(Seq(Foo("bye"), Foo("hi"))).collect()
    println(objs.toSeq)
  }
}

还值得注意的是,群集上运行的DBR版本和本地计算机上运行的DB Connect版本必须相同。