使用Scala的Spark:在Cassandra中写入类似null的字段值而不是TupleValue

时间:2016-12-27 14:06:47

标签: scala apache-spark cassandra databricks

在我的一个收藏中,假设我有以下字段:

f: frozen<tuple<text, set<text>>

假设我想使用Scala脚本插入此特定字段为空,空,不存在等的条目,其中在插入之前我映射条目的字段,如下所示:

sRow("fk") = null // or None, or maybe I simply don't specify the field at all

当尝试运行spark脚本时(来自Databricks,Spark连接器版本1.6),我收到以下错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 133.0 failed 1 times, most recent failure: Lost task 6.0 in stage 133.0 (TID 447, localhost): com.datastax.spark.connector.types.TypeConversionException: Cannot convert object null to com.datastax.spark.connector.TupleValue.
    at com.datastax.spark.connector.types.TypeConverter$$anonfun$convert$1.apply(TypeConverter.scala:47)
    at com.datastax.spark.connector.types.TypeConverter$$anonfun$convert$1.apply(TypeConverter.scala:43)

使用None代替null时,我仍然会收到错误,但不同的错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 143.0 failed 1 times, most recent failure: Lost task 2.0 in stage 143.0 (TID 474, localhost): java.lang.IllegalArgumentException: requirement failed: Expected 2 components, instead of 0
    at scala.Predef$.require(Predef.scala:233)
    at com.datastax.spark.connector.types.TupleType.newInstance(TupleType.scala:55)

我理解Cassandra没有null的确切概念,但是我知道有一种方法可以在将条目插入Cassandra时将值保留,就像我从其他环境中做到这一点,比如使用Cassandra的nodejs驱动程序。 在插入预期的TupleValue或者某些用户定义的类型时,如何强制使用类似null的值?

1 个答案:

答案 0 :(得分:0)

使用现代版本的Cassandra,您可以使用“Unbound”功能让它实际上跳过空值。这可能是您用例的最佳选择,因为编写null会隐式编写逻辑删除。

请参阅 Treating nulls as Unset

//Setup original data (1, 1, 1) --> (6, 6, 6)
sc.parallelize(1 to 6).map(x => (x, x, x)).saveToCassandra(ks, "tab1")

val ignoreNullsWriteConf = WriteConf.fromSparkConf(sc.getConf).copy(ignoreNulls = true)
//These writes will not delete because we are ignoring nulls
val optRdd = sc.parallelize(1 to 6)
  .map(x => (x, None, None))
  .saveToCassandra(ks, "tab1", writeConf = ignoreNullsWriteConf)

val results = sc.cassandraTable[(Int, Int, Int)](ks, "tab1").collect

results
/**
  (1, 1, 1),
  (2, 2, 2),
  (3, 3, 3),
  (4, 4, 4),
  (5, 5, 5),
  (6, 6, 6)
**/

还有更精细的控件 Full Docs