我正在尝试使用Scala(2.11),Spark Streaming(1.5.0)和Cassandra(3.5)创建日志处理应用程序。目前,在收到第一组rdd项目并运行foreachRDD(...)时,
saveAsCassandraTable()方法在Cassandra中正确创建了所需的表模式,但未在表中插入任何RDD条目。
logitems.foreachRDD(items => {
if (items.count() == 0)
println("No log item received")
else{
val first = items.first()
println(first.timestamp) // WORKS: Shows the timestamp in the first rdd element
items.saveAsCassandraTable("analytics", "test_logs", SomeColumns("timestamp", "c_ip", "c_referrer", "c_user_agent"))
//table schema is created but the rdd items are not written
}
})
16/05/19 16:15:06 INFO Cluster: New Cassandra host /192.168.1.95:9042 added
16/05/19 16:15:06 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
16/05/19 16:15:07 INFO SparkContext: Starting job: foreachRDD at StreamingApp.scala:27
16/05/19 16:15:07 INFO DAGScheduler: Got job 8 (foreachRDD at StreamingApp.scala:27) with 8 output partitions
16/05/19 16:15:07 INFO DAGScheduler: Final stage: ResultStage 6(foreachRDD at StreamingApp.scala:27)
16/05/19 16:15:07 INFO DAGScheduler: Parents of final stage: List()
16/05/19 16:15:07 INFO DAGScheduler: Missing parents: List()
16/05/19 16:15:07 INFO DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[9] at map at StreamingApp.scala:24), which has no missing parents
16/05/19 16:15:07 INFO MemoryStore: ensureFreeSpace(13272) called with curMem=122031, maxMem=1538166620
16/05/19 16:15:07 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 13.0 KB, free 1466.8 MB)
16/05/19 16:15:07 INFO MemoryStore: ensureFreeSpace(5909) called with curMem=135303, maxMem=1538166620
16/05/19 16:15:07 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 5.8 KB, free 1466.8 MB)
16/05/19 16:15:07 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:63323 (size: 5.8 KB, free: 1466.8 MB)
16/05/19 16:15:07 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:861
16/05/19 16:15:07 INFO DAGScheduler: Submitting 8 missing tasks from ResultStage 6 (MapPartitionsRDD[9] at map at StreamingApp.scala:24)
16/05/19 16:15:07 INFO TaskSchedulerImpl: Adding task set 6.0 with 8 tasks
16/05/19 16:15:07 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 34, localhost, PROCESS_LOCAL, 1943 bytes)
16/05/19 16:15:07 INFO TaskSetManager: Starting task 1.0 in stage 6.0 (TID 35, localhost, PROCESS_LOCAL, 1943 bytes)
16/05/19 16:15:07 INFO TaskSetManager: Starting task 2.0 in stage 6.0 (TID 36, localhost, PROCESS_LOCAL, 1943 bytes)
16/05/19 16:15:07 INFO TaskSetManager: Starting task 3.0 in stage 6.0 (TID 37, localhost, PROCESS_LOCAL, 1943 bytes)
16/05/19 16:15:07 INFO TaskSetManager: Starting task 4.0 in stage 6.0 (TID 38, localhost, PROCESS_LOCAL, 1943 bytes)
16/05/19 16:15:07 INFO TaskSetManager: Starting task 5.0 in stage 6.0 (TID 39, localhost, PROCESS_LOCAL, 1943 bytes)
16/05/19 16:15:07 INFO TaskSetManager: Starting task 6.0 in stage 6.0 (TID 40, localhost, PROCESS_LOCAL, 1943 bytes)
16/05/19 16:15:07 INFO Executor: Running task 0.0 in stage 6.0 (TID 34)
16/05/19 16:15:07 INFO Executor: Running task 1.0 in stage 6.0 (TID 35)
16/05/19 16:15:07 INFO Executor: Running task 4.0 in stage 6.0 (TID 38)
16/05/19 16:15:07 INFO Executor: Running task 5.0 in stage 6.0 (TID 39)
16/05/19 16:15:07 INFO Executor: Running task 3.0 in stage 6.0 (TID 37)
16/05/19 16:15:07 INFO Executor: Running task 2.0 in stage 6.0 (TID 36)
16/05/19 16:15:07 INFO Executor: Running task 6.0 in stage 6.0 (TID 40)
16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_6 locally
16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_3 locally
16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_4 locally
16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_2 locally
16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_1 locally
16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_5 locally
16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_0 locally
16/05/19 16:15:10 INFO JobScheduler: Added jobs for time 1463667310000 ms
16/05/19 16:15:15 INFO JobScheduler: Added jobs for time 1463667315000 ms
16/05/19 16:15:20 INFO JobScheduler: Added jobs for time 1463667320000 ms
16/05/19 16:15:25 INFO JobScheduler: Added jobs for time 1463667325000 ms
16/05/19 16:15:30 INFO JobScheduler: Added jobs for time 1463667330000 ms
16/05/19 16:15:35 INFO JobScheduler: Added jobs for time 1463667335000 ms
16/05/19 16:15:40 INFO JobScheduler: Added jobs for time 1463667340000 ms
16/05/19 16:15:45 INFO JobScheduler: Added jobs for time 1463667345000 ms
.... continues until program is manually terminated
我很高兴能找到解决这个问题的方法。
我附上了火花ui的截图。
答案 0 :(得分:0)
我怀疑随后对saveAsCassandraTable
的调用失败,因为Table已经存在。你应该把表放在流循环之外。
我会检查切换到saveToCassandra
是否解决了问题。如果不是,它可能有助于获取执行程序日志或Streaming UI的屏幕截图。