我在服务器上设置了Apache Spark,它现在全部可以运行并等待数据紧急。
这是我的Java代码:
SparkConf conf = new SparkConf().setAppName("myFirstJob").setMaster("spark://10.0.100.120:7077");
JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
javaSparkContext.setLogLevel("WARN");
SQLContext sqlContext = new SQLContext(javaSparkContext);
System.out.println("Hello, Remote Spark v." + javaSparkContext.version());
DataFrame df;
df = sqlContext.read().option("dateFormat", "yyyy-mm-dd")
.json("./src/main/resources/north-carolina-school-performance-data.json"); // this is line #31
df = df.withColumn("district", df.col("fields.district"));
df = df.groupBy("district").count().orderBy(df.col("district"));
df.show(150);
Spark抱怨服务器上没有./src/main/resources/north-carolina-school-performance-data.json
文件:
16/07/12 15:08:31 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, micha): java.io.FileNotFoundException: File file:/Users/jgp/git/net.jgp.labs.spark/src/main/resources/north-carolina-school-performance-data.json does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
...
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244)
**at net.jgp.labs.spark.FirstJob.main(FirstJob.java:31)**
Caused by: java.io.FileNotFoundException: File file:/Users/jgp/git/net.jgp.labs.spark/src/main/resources/north-carolina-school-performance-data.json does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
很公平,服务器上的不是。我希望open能在本地获取文件,驱动程序在哪里运行并发送给它。有没有办法做到这一点,还是超出了Apache Spark的范围?如果它在外面,任何关于正确做的建议(我的意思是我可以设置一个CIFS服务器等等,但我发现它有点难看)?