在RDD阅读期间,Spark挂起

时间:2016-04-05 14:37:40

标签: scala apache-spark

我有Apache Spark个主节点。当我尝试迭代RDD时,Spark会挂起。

以下是我的代码示例:

val conf = new SparkConf()
      .setAppName("Demo")
      .setMaster("spark://localhost:7077")
      .set("spark.executor.memory", "1g")

val sc = new SparkContext(conf)

val records = sc.textFile("file:///Users/barbara/projects/spark/src/main/resources/videos.csv")    
println("Start")   

records.collect().foreach(println)    

println("Finish")

Spark日志说:

Start
16/04/05 17:32:23 INFO FileInputFormat: Total input paths to process : 1
16/04/05 17:32:23 INFO SparkContext: Starting job: collect at Application.scala:23
16/04/05 17:32:23 INFO DAGScheduler: Got job 0 (collect at Application.scala:23) with 2 output partitions
16/04/05 17:32:23 INFO DAGScheduler: Final stage: ResultStage 0 (collect at Application.scala:23)
16/04/05 17:32:23 INFO DAGScheduler: Parents of final stage: List()
16/04/05 17:32:23 INFO DAGScheduler: Missing parents: List()
16/04/05 17:32:23 INFO DAGScheduler: Submitting ResultStage 0 (file:///Users/barbara/projects/spark/src/main/resources/videos.csv MapPartitionsRDD[1] at textFile at Application.scala:19), which has no missing parents
16/04/05 17:32:23 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.0 KB, free 120.5 KB)
16/04/05 17:32:23 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1811.0 B, free 122.3 KB)
16/04/05 17:32:23 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.18.199.187:55983 (size: 1811.0 B, free: 2.4 GB)
16/04/05 17:32:23 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/04/05 17:32:23 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (file:///Users/barbara/projects/spark/src/main/resources/videos.csv MapPartitionsRDD[1] at textFile at Application.scala:19)
16/04/05 17:32:23 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks

我只看到了一个"开始"信息。似乎Spark无法读取RDD。任何想法如何解决它?

UPD

我想读的数据:

123v4n312bv4nb12,Action,Comedy
2n4vhj2gvrh24gvr,Action,Drama
sjfu326gjrw6g374,Drama,Horror

2 个答案:

答案 0 :(得分:2)

如果Spark手上这么小的数据集,我首先会寻找:

  • 我是否尝试连接到不响应/存在的群集?如果我尝试连接到正在运行的集群,我会首先尝试在本地运行相同的代码setMaster("local[*]")。如果这样可行,我会知道我尝试连接的“主人”正在发生一些事情。

  • 我是否要求提供更多资源,了解群集的功能?例如,如果集群管理2G并且我要求3GB执行程序,我的应用程序将永远不会获得计划,它将永远在作业队列中。

具体到上述评论。如果您按sbin/start-master.sh启动了群集,则无法获得正在运行的群集。至少你需要一个主人和一个工人(对于独立的)。您应该使用start-all.sh脚本。我建议多做一些功课并按照教程进行操作。

答案 1 :(得分:0)

请改用:

val bufferedSource = io.Source.fromFile("/path/filename.csv")

    for (line <- bufferedSource.getLines) {
        println(line)
    }