使用Cassandra进行Scala Spark Filter RDD

时间:2017-02-08 19:48:13

标签: scala apache-spark spark-streaming spark-cassandra-connector

我是Spark-Cassandra和Scala的新手。我有一个现有的RDD。让我们说:

((url_hash,url,created_timestamp))。

我想基于url_hash过滤此RDD。如果在Cassandra表中存在url_hash,那么我想从RDD中过滤掉它,这样我就可以只对新的URL进行处理。

Cassandra表如下:

 url_hash| url | created_timestamp | updated_timestamp

任何指针都会很棒。

我尝试过这样的事情:

   case class UrlInfoT(url_sha256: String, full_url: String, created_ts: Date)
   def timestamp = new java.utils.Date()
   val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
   val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
   val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
   newUrlsRDD = rdd1.subtractByKey(rdd3) 

我收到cassandra错误

java.lang.NullPointerException: Unexpected null value of column full_url in      keyspace.url_info.If you want to receive null values from Cassandra, please wrap the column type into Option or use JavaBeanColumnMapper

cassandra表中没有空值

1 个答案:

答案 0 :(得分:1)

感谢Archetypal Paul!

我希望有人觉得这很有用。不得不在案例类中添加Option。

期待更好的解决方案

case class UrlInfoT(url_sha256: String, full_url: Option[String], created_ts: Option[Date])

def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace",   "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3)