带有repartitionByCassandraReplica函数的cassandra-spark-connector错误

时间:2015-04-17 13:36:51

标签: cassandra apache-spark connector

我正在尝试使用1.2版本中的新连接功能,但我在repl中遇到了repartitionByCassandraReplica函数的错误。

我试图复制网站示例,并创建了一个带有几个元素的cassandra表(shopping_history): https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.mde

import com.datastax.spark.connector.rdd._
import com.datastax.spark.connector.cql.CassandraConnector
import com.datastax.spark.connector._
import com.datastax.driver.core._

case class CustomerID(cust_id: Int)
val idsOfInterest = sc.parallelize(1 to 1000).map(CustomerID(_))
val repartitioned =  idsOfInterest.repartitionByCassandraReplica("cim_dev", "shopping_history", 10)
repartitioned.first()

我收到此错误:

15/04/13 18:35:43 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, dev2-cim.aid.fr): java.lang.ClassNotFoundException: $line31.$read$$iwC$$iwC$CustomerID
    at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:344)
    at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
    at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098)
    at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098)
    at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
    at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:56)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

我使用带有连接器1.2.0 RC 3的spark 1.2.0。 在idsOfInterest上使用的joinWithCassandraTable函数可以工作。

我也很好奇:joinWithCassandraTable / cassandraTable与In子句/ foreachPartition(withSessionDo)语法之间的差异。

他们是否都将数据请求到充当协调员的本地节点? joinWithCassandraTable与repartitionByCassandraReplica结合使用与异步查询一样高效,只向本地节点请求数据吗?如果未应用repartitionByCassandraReplica会发生什么?

我已经在cassandra连接器的google群组论坛上提出了这个问题: https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/b615ANGSySc

由于

1 个答案:

答案 0 :(得分:2)

我会在这里回答你的第二个问题,如果我可以根据更多信息找出问题,请跟进第一部分。

  

我也很好奇它们之间的差异:   使用In子句/ joinWithCassandraTable / cassandraTable   foreachPartition(withSessionDo)语法。

带有in子句的cassandraTable将创建一个单独的spark分区。因此,对于非常小的子句可能是好的,但该子句必须从驱动程序序列化到spark应用程序。这对于大型子句来说可能非常糟糕,一般来说,如果我们不需要,我们不希望从火花驱动程序向执行程序发送数据。

joinWithCassandraTableforeachPartition(withSessionDo)非常相似。主要区别在于joinWithCassandraTable调用正在使用Connector转换和读取代码,这使得从Cassandra Rows中获取Scala对象变得更加容易。在这两种情况下,您的数据都保持RDD格式,不会被序列化回驱动程序。它们还将使用先前RDD(或最后一个公开preferredLocation方法的RDD)的分区程序,以便它们能够使用repartitionByCassandraTable。

如果未应用repartitionByCassandraTable,将在节点上请求数据,该节点可能是也可能不是您请求的信息的协调员。这将在您的查询中添加额外的网络跃点,但这可能不会是一个非常大的性能损失。在加入之前要重新分区的时间实际上取决于重新分区操作中的数据总量和spark shuffle的成本。