我正在Azure上使用数据块,并且我的过程的一部分包括使用TwoSigma的Flint
。我将该库上传到databricks库,并且能够在databricks工作区的笔记本中运行以下示例代码。
当我尝试使用databricks-connect
时出现问题。尽管通常一切正常,但是当尝试使用外部库(包括Flint
时,在spark-shell --packages 'com.twosigma:flint:0.6.0'
下运行的以下代码会产生以下错误。
import org.apache.spark.sql.functions._
import com.twosigma.flint.timeseries.TimeSeriesRDD
import scala.concurrent.duration._
import spark.implicits._
val df = Seq(("2018-08-20", 1.0), ("2018-08-20", 2.0), ("2018-08-21", 3.0)).toDF("time", "number").withColumn("time", from_utc_timestamp($"time", "UTC"))
val tsRdd = TimeSeriesRDD.fromDF(df)(isSorted=false, timeUnit=DAYS)
val results = tsRdd.groupByCycle()
results.toDF.show
错误如下:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 78.0 failed 4 times, most recent failure: Lost task 2.3 in stage 78.0 (TID 2164, 10.139.64.7, executor 0): java.lang.ClassCastException: org.apache.spark.sql.types.StructField cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)
at com.twosigma.flint.timeseries.TimeSeriesStore$.getInternalRowConverter(TimeSeriesStore.scala:108)
at com.twosigma.flint.timeseries.TimeSeriesStore$$anonfun$2.apply(TimeSeriesStore.scala:53)
at com.twosigma.flint.timeseries.TimeSeriesStore$$anonfun$2.apply(TimeSeriesStore.scala:52)
at org.apache.spark.rdd.RDD$client1f520962c6$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
at org.apache.spark.rdd.RDD$client1f520962c6$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
我是否以错误的方式指定了依赖项?
答案 0 :(得分:-1)
要指定依赖项,您需要将它们作为在整个集群中共享的JAR文件包括在内。来自docs:
通常,您的主类或Python文件将具有其他依赖关系JAR和文件。您可以通过调用
sparkContext.addJar("path-to-the-jar")
或sparkContext.addPyFile("path-to-the-file")
添加此类依赖关系JAR和文件。您也可以使用addPyFile()
界面添加Egg文件和zip文件。每次在IDE中运行代码时,依赖关系JAR和文件都会安装在集群上。
这是Scala中的一个示例(同样来自文档):
package com.example
import org.apache.spark.sql.SparkSession
case class Foo(x: String)
object Test {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
...
.getOrCreate();
spark.sparkContext.setLogLevel("INFO")
println("Running simple show query...")
spark.read.parquet("/tmp/x").show()
println("Running simple UDF query...")
// Adding external library to project
spark.sparkContext.addJar("./target/scala-2.11/hello-world_2.11-1.0.jar")
spark.udf.register("f", (x: Int) => x + 1)
spark.range(10).selectExpr("f(id)").show()
println("Running custom objects query...")
val objs = spark.sparkContext.parallelize(Seq(Foo("bye"), Foo("hi"))).collect()
println(objs.toSeq)
}
}
还值得注意的是,群集上运行的DBR版本和本地计算机上运行的DB Connect版本必须相同。