Question

我有一张垂直生长的大桌子。我想小批量读取行，以便我可以处理每个行并保存结果。

表格定义

CREATE TABLE foo ( 
uid timeuuid, 
events blob, 
PRIMARY KEY ((uid)) 
)

代码尝试1 - 使用CassandraSQLContext

// Step 1. Get uuid of the last row in a batch
val max = 10
val rdd = sc.cassandraTable("foo", "bar")
var cassandraRows = rdd.take(max)
var lastUUID = cassandraRows.last.getUUID("uid"); 
// lastUUID = 131ea620-2e4e-11e4-a2fc-8d5aad979e84


// Step 2. Use last row as a pointer to the start of the next batch
val cc = new CassandraSQLContext(sc)
val cql = s"SELECT events from foo.bar where token(uid) > token($lastUUID) limit $max"

// which is at runtime
// SELECT events from foo.bar WHERE 
// token(uid) > token(131ea620-2e4e-11e4-a2fc-8d5aad979e84) limit 10

cc.sql(cql).collect()

最后一行抛出

线程“main”中的异常java.lang.RuntimeException：[1.79]失败：   ``）''预计但是标识符ea620找到了

从foo.bar中选择事件，其中token（uid）＆gt;   令牌（131ea620-2e4e-11e4-a2fc-8d5aad979e84）限制10                                                                                 ^           在scala.sys.package $ .error（package.scala：27）           在org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply（SparkSQLParser.scala：33）           在org.apache.spark.sql.SQLContext $$ anonfun $ 1.apply（SQLContext.scala：79）           at org.apache.spark.sql.SQLContext $$ anonfun $ 1.apply（SQLContext.scala：79）

但如果我在 cqlsh 中运行我的cql，它会返回正确的10条记录。

代码尝试2 - 使用DataStax Cassandra连接器

// Step 1. Get uuid of the last row in a batch
val max = 10
val rdd = sc.cassandraTable("foo", "bar")
var cassandraRows = rdd.take(max)
var lastUUID = cassandraRows.last.getUUID("uid"); 
// lastUUID = 131ea620-2e4e-11e4-a2fc-8d5aad979e84

// Step 2. Execute query
rdd.where(s"token(uid) > token($lastUUID)").take(max)

抛出

org.apache.spark.SparkException：作业因阶段失败而中止：阶段1.0中的任务0失败1次，最近失败：丢失任务0.0 在阶段1.0（TID 1，localhost）：java.io.IOException：期间的异常准备SELECT“uid”，“events”FROM“foo”。“bar”WHERE 令牌（“uid”）＆gt; ？ AND令牌（“uid”）＆lt; =？和uid＆gt; $ lastUUID ALLOW 过滤：第1：118行在字符'$'
中没有可行的选择

如何在spark和Cassandra中使用where token(...)查询？

Answer 1

I would use the DataStax Cassandra Java Driver. Similar to your CassandraSQLContext, you would select chunks like this:

val query = QueryBuilder.select("events")
  .where(gt(token("uid"),token(lastUUID))
  .limit(10)
val rows = session.execute(query).all()

If you want to asynchronously query, session also has executeAsync, which returns a RichListenableFuture which can be wrapped by a scala Future by adding a callback.

Spark：如何从Cassandra中读取表格的大块

代码尝试1 - 使用CassandraSQLContext

代码尝试2 - 使用DataStax Cassandra连接器

1 个答案: