rdd.saveAsCassandraTable()创建表,但不会将rdd项写入Cassandra

时间:2016-05-19 15:26:22

标签: scala apache-spark cassandra

我正在尝试使用Scala(2.11),Spark Streaming(1.5.0)和Cassandra(3.5)创建日志处理应用程序。目前,在收到第一组rdd项目并运行foreachRDD(...)时,

  1. 该组中的第一个元素打印没有任何问题
  2. saveAsCassandraTable()方法在Cassandra中正确创建了所需的表模式,但未在表中插入任何RDD条目。

    logitems.foreachRDD(items => {
      if (items.count() == 0)
        println("No log item received")
      else{
        val first = items.first()
        println(first.timestamp)  // WORKS: Shows the timestamp in the first rdd element
    
        items.saveAsCassandraTable("analytics", "test_logs", SomeColumns("timestamp", "c_ip", "c_referrer", "c_user_agent"))
        //table schema is created but the rdd items are not written
      }
    })
    
    
    
    16/05/19 16:15:06 INFO Cluster: New Cassandra host /192.168.1.95:9042 added
    16/05/19 16:15:06 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
    16/05/19 16:15:07 INFO SparkContext: Starting job: foreachRDD at StreamingApp.scala:27
    16/05/19 16:15:07 INFO DAGScheduler: Got job 8 (foreachRDD at StreamingApp.scala:27) with 8 output partitions
    16/05/19 16:15:07 INFO DAGScheduler: Final stage: ResultStage 6(foreachRDD at StreamingApp.scala:27)
    16/05/19 16:15:07 INFO DAGScheduler: Parents of final stage: List()
    16/05/19 16:15:07 INFO DAGScheduler: Missing parents: List()
    16/05/19 16:15:07 INFO DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[9] at map at StreamingApp.scala:24), which has no missing parents
    16/05/19 16:15:07 INFO MemoryStore: ensureFreeSpace(13272) called with curMem=122031, maxMem=1538166620
    16/05/19 16:15:07 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 13.0 KB, free 1466.8 MB)
    16/05/19 16:15:07 INFO MemoryStore: ensureFreeSpace(5909) called with curMem=135303, maxMem=1538166620
    16/05/19 16:15:07 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 5.8 KB, free 1466.8 MB)
    16/05/19 16:15:07 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:63323 (size: 5.8 KB, free: 1466.8 MB)
    16/05/19 16:15:07 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:861
    16/05/19 16:15:07 INFO DAGScheduler: Submitting 8 missing tasks from ResultStage 6 (MapPartitionsRDD[9] at map at StreamingApp.scala:24)
    16/05/19 16:15:07 INFO TaskSchedulerImpl: Adding task set 6.0 with 8 tasks
    16/05/19 16:15:07 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 34, localhost, PROCESS_LOCAL, 1943 bytes)
    16/05/19 16:15:07 INFO TaskSetManager: Starting task 1.0 in stage 6.0 (TID 35, localhost, PROCESS_LOCAL, 1943 bytes)
    16/05/19 16:15:07 INFO TaskSetManager: Starting task 2.0 in stage 6.0 (TID 36, localhost, PROCESS_LOCAL, 1943 bytes)
    16/05/19 16:15:07 INFO TaskSetManager: Starting task 3.0 in stage 6.0 (TID 37, localhost, PROCESS_LOCAL, 1943 bytes)
    16/05/19 16:15:07 INFO TaskSetManager: Starting task 4.0 in stage 6.0 (TID 38, localhost, PROCESS_LOCAL, 1943 bytes)
    16/05/19 16:15:07 INFO TaskSetManager: Starting task 5.0 in stage 6.0 (TID 39, localhost, PROCESS_LOCAL, 1943 bytes)
    16/05/19 16:15:07 INFO TaskSetManager: Starting task 6.0 in stage 6.0 (TID 40, localhost, PROCESS_LOCAL, 1943 bytes)
    16/05/19 16:15:07 INFO Executor: Running task 0.0 in stage 6.0 (TID 34)
    16/05/19 16:15:07 INFO Executor: Running task 1.0 in stage 6.0 (TID 35)
    16/05/19 16:15:07 INFO Executor: Running task 4.0 in stage 6.0 (TID 38)
    16/05/19 16:15:07 INFO Executor: Running task 5.0 in stage 6.0 (TID 39)
    16/05/19 16:15:07 INFO Executor: Running task 3.0 in stage 6.0 (TID 37)
    16/05/19 16:15:07 INFO Executor: Running task 2.0 in stage 6.0 (TID 36)
    16/05/19 16:15:07 INFO Executor: Running task 6.0 in stage 6.0 (TID 40)
    16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_6 locally
    16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_3 locally
    16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_4 locally
    16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_2 locally
    16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_1 locally
    16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_5 locally
    16/05/19 16:15:07 INFO BlockManager: Found block rdd_9_0 locally
    16/05/19 16:15:10 INFO JobScheduler: Added jobs for time 1463667310000 ms
    16/05/19 16:15:15 INFO JobScheduler: Added jobs for time 1463667315000 ms
    16/05/19 16:15:20 INFO JobScheduler: Added jobs for time 1463667320000 ms
    16/05/19 16:15:25 INFO JobScheduler: Added jobs for time 1463667325000 ms
    16/05/19 16:15:30 INFO JobScheduler: Added jobs for time 1463667330000 ms
    16/05/19 16:15:35 INFO JobScheduler: Added jobs for time 1463667335000 ms
    16/05/19 16:15:40 INFO JobScheduler: Added jobs for time 1463667340000 ms
    16/05/19 16:15:45 INFO JobScheduler: Added jobs for time 1463667345000 ms
    
    .... continues until program is manually terminated
    
  3. 我很高兴能找到解决这个问题的方法。

    spark ui

    我附上了火花ui的截图。

1 个答案:

答案 0 :(得分:0)

我怀疑随后对saveAsCassandraTable的调用失败,因为Table已经存在。你应该把表放在流循环之外。

我会检查切换到saveToCassandra是否解决了问题。如果不是,它可能有助于获取执行程序日志或Streaming UI的屏幕截图。