Apache Spark传输数据

时间:2016-07-12 19:24:45

标签: java apache-spark

我在服务器上设置了Apache Spark,它现在全部可以运行并等待数据紧急。

这是我的Java代码:

    SparkConf conf = new SparkConf().setAppName("myFirstJob").setMaster("spark://10.0.100.120:7077");
    JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
    javaSparkContext.setLogLevel("WARN");
    SQLContext sqlContext = new SQLContext(javaSparkContext);

    System.out.println("Hello, Remote Spark v." + javaSparkContext.version());

    DataFrame df;
    df = sqlContext.read().option("dateFormat", "yyyy-mm-dd")
            .json("./src/main/resources/north-carolina-school-performance-data.json"); // this is line #31
    df = df.withColumn("district", df.col("fields.district"));
    df = df.groupBy("district").count().orderBy(df.col("district"));
    df.show(150);

Spark抱怨服务器上没有./src/main/resources/north-carolina-school-performance-data.json文件:

16/07/12 15:08:31 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, micha): java.io.FileNotFoundException: File file:/Users/jgp/git/net.jgp.labs.spark/src/main/resources/north-carolina-school-performance-data.json does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
...
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
    at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244)
    **at net.jgp.labs.spark.FirstJob.main(FirstJob.java:31)**
Caused by: java.io.FileNotFoundException: File file:/Users/jgp/git/net.jgp.labs.spark/src/main/resources/north-carolina-school-performance-data.json does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)

很公平,服务器上的不是。我希望open能在本地获取文件,驱动程序在哪里运行并发送给它。有没有办法做到这一点,还是超出了Apache Spark的范围?如果它在外面,任何关于正确做的建议(我的意思是我可以设置一个CIFS服务器等等,但我发现它有点难看)?

0 个答案:

没有答案