joinWithCassandraTable()懒惰吗?

时间:2016-01-18 15:29:01

标签: apache-spark cassandra spark-cassandra-connector

我正在使用Sprak 1.2.1和spark-cassandra-connector

//join with cassandra
val rdd = some_array.map(x => SomeClass(x._1,x._2)).joinWithCassandraTable(keyspace, some_table)
println(timer, "Join")

//get only the jsons and create rdd temp table
val jsons = rdd.map(_._2.getString("this"))
val jsonSchemaRDD = sqlContext.jsonRDD(jsons)
jsonSchemaRDD.registerTempTable("this_json")
println(timer, "Map")

输出结果为:

Timer "Join"- 558 ms
Timer "Map"- 290284 ms

我猜“joinWithCassandraTable()”函数是懒惰的,如果有的话,是什么引发了它?

1 个答案:

答案 0 :(得分:4)

实际上,触发评估的部分是sqlContext.jsonRDD。由于您未提供schema,因此必须具体化jsons才能推断出来。

joinWithCassandraTable有点类似,因为它必须连接到Cassandra并获取所需的元数据。见Apache Spark: Driver (instead of just the Executors) tries to connect to Cassandra