我有Apache Spark
个主节点。当我尝试迭代RDD时,Spark会挂起。
以下是我的代码示例:
val conf = new SparkConf()
.setAppName("Demo")
.setMaster("spark://localhost:7077")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
val records = sc.textFile("file:///Users/barbara/projects/spark/src/main/resources/videos.csv")
println("Start")
records.collect().foreach(println)
println("Finish")
Spark日志说:
Start
16/04/05 17:32:23 INFO FileInputFormat: Total input paths to process : 1
16/04/05 17:32:23 INFO SparkContext: Starting job: collect at Application.scala:23
16/04/05 17:32:23 INFO DAGScheduler: Got job 0 (collect at Application.scala:23) with 2 output partitions
16/04/05 17:32:23 INFO DAGScheduler: Final stage: ResultStage 0 (collect at Application.scala:23)
16/04/05 17:32:23 INFO DAGScheduler: Parents of final stage: List()
16/04/05 17:32:23 INFO DAGScheduler: Missing parents: List()
16/04/05 17:32:23 INFO DAGScheduler: Submitting ResultStage 0 (file:///Users/barbara/projects/spark/src/main/resources/videos.csv MapPartitionsRDD[1] at textFile at Application.scala:19), which has no missing parents
16/04/05 17:32:23 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.0 KB, free 120.5 KB)
16/04/05 17:32:23 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1811.0 B, free 122.3 KB)
16/04/05 17:32:23 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.18.199.187:55983 (size: 1811.0 B, free: 2.4 GB)
16/04/05 17:32:23 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/04/05 17:32:23 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (file:///Users/barbara/projects/spark/src/main/resources/videos.csv MapPartitionsRDD[1] at textFile at Application.scala:19)
16/04/05 17:32:23 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
我只看到了一个"开始"信息。似乎Spark无法读取RDD。任何想法如何解决它?
UPD
我想读的数据:
123v4n312bv4nb12,Action,Comedy
2n4vhj2gvrh24gvr,Action,Drama
sjfu326gjrw6g374,Drama,Horror
答案 0 :(得分:2)
如果Spark手上这么小的数据集,我首先会寻找:
我是否尝试连接到不响应/存在的群集?如果我尝试连接到正在运行的集群,我会首先尝试在本地运行相同的代码setMaster("local[*]")
。如果这样可行,我会知道我尝试连接的“主人”正在发生一些事情。
我是否要求提供更多资源,了解群集的功能?例如,如果集群管理2G并且我要求3GB执行程序,我的应用程序将永远不会获得计划,它将永远在作业队列中。
具体到上述评论。如果您按sbin/start-master.sh
启动了群集,则无法获得正在运行的群集。至少你需要一个主人和一个工人(对于独立的)。您应该使用start-all.sh
脚本。我建议多做一些功课并按照教程进行操作。
答案 1 :(得分:0)
请改用:
val bufferedSource = io.Source.fromFile("/path/filename.csv")
for (line <- bufferedSource.getLines) {
println(line)
}