Question

我制作了一个7个独立的apache集群。要运行scala代码，代码为

/** Our main function where the action happens */

def main(args: Array[String]) {

    // Set the log level to only print errors

    Logger.getLogger("org").setLevel(Level.ERROR)

    // Create a SparkContext without much actual configuration

    // We want EMR's config defaults to be used.

    val conf = new SparkConf()

    conf.setAppName("MovieSimilarities1M")

    val sc = new SparkContext(conf)

    val input = sc.textFile("file:///home/ralfahad/LearnSpark/SBTCreate/customer-orders.csv")

    val mappedInput = input.map(extractCustomerPricePairs)

    val totalByCustomer = mappedInput.reduceByKey( (x,y) => x + y )

    val flipped = totalByCustomer.map( x => (x._2, x._1) )

    val totalByCustomerSorted = flipped.sortByKey()

    val results = totalByCustomerSorted.collect()

    // Print the results.

    results.foreach(println)

  }

}

步骤是：

我使用sbt
使用spark-submit * .jar

但是我的遗嘱执行人找不到sc.textFile("file:///home/ralfahad/LearnSpark/SBTCreate/customer-orders.csv")

此customer-orders.csv文件存储在我的主PC中。

完整的堆栈跟踪：

错误：[第0阶段：＆gt; （0 + 2）/ 2] 17/09/25 17:32:35错误TaskSetManager：阶段0.0中的任务0失败了4次;中止作业线程中的异常＆＃34;主＆＃34; org.apache.spark.SparkException：作业由于阶段而中止失败：阶段0.0中的任务0失败4次，最近失败：丢失第0.0阶段的任务0.3（TID 5,141.225.166.191，执行人2）： java.io.FileNotFoundException：文件文件：/home/ralfahad/LearnSpark/SBTCreate/customer-orders.csv没有存在

如何解决此问题？

请修改要在我的群集中运行的代码。

Answer 1

为了让您的工作节点能够访问该文件，您有几个选择。

<强> 1。手动将文件复制到所有节点。

每个节点都应该具有此文件：/home/ralfahad/LearnSpark/SBTCreate/customer-orders.csv

<强> 2。使用附件提交您的工作。

spark-submit有一个选项调用--files，允许您在提交作业时复制任意数量的文件，如下所示：

spark-submit --master ... -jars ... --files /home/ralfahad/LearnSpark/SBTCreate/customer-orders.csv

不要滥用这个。此选项更适用于测试目的和小文件。

第3。使用一些可由所有节点访问的外部通用存储。

S3和NFS共享是受欢迎的选择。

sc.textFile("s3n://bucketname/customer-orders.csv")

<强> 4。您可以在驱动程序中读取数据，然后将其转换为RDD进行处理。

val bufferedSource = io.Source.fromFile("/home/ralfahad/LearnSpark/SBTCreate/customer-orders.csv")
val lines = (for (line <- bufferedSource.getLines()) yield line).toList
val rdd = sc.makeRdd(lines)

一般不推荐，但可以用于快速测试。

火花数据形成集合

1 个答案: