我正在尝试使用不包含Hadoop的环境运行Spark ML算法。
如果有可能,我还没有从教程和其他帖子中找到答案:
我可以在不使用任何版本的Hadoop和任何HDFS的情况下运行Spark吗?或者我应该为Spark安装Hadoop吗?
运行Spark shell时,我收到以下消息:
C:\spark-2.2.0-bin-without-hadoop\bin>spark-shell
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:124)
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:124)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:124)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:110)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
以下是我的示例程序:
package com.example.spark_example;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
public class Main {
public static void main(String[] args) {
String logFile = "C:\\spark-2.2.0-bin-without-hadoop\\README.md"; // Should be some file on your system
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter((Function<String, Boolean>) s -> s.contains("a")).count();
long numBs = logData.filter((Function<String, Boolean>) s -> s.contains("b")).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
sc.stop();
}
}
导致以下异常:
17/08/10 15:23:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/10 15:23:35 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
答案 0 :(得分:2)
我可以在不使用任何版本的Hadoop
的情况下运行Spark
你做不到。虽然Spark不需要Hadoop集群(YARN,HDFS),但它依赖于Hadoop库。如果您没有提供这些的Hadoop安装,请使用为Apache Hadoop预先构建的完整版本描述。在你的情况下:
spark-2.2.0-bin-hadoop2.7
答案 1 :(得分:1)
如果您使用prebuild包类型下载了Apache Spark,则需要所有库。 要解决您的问题,您需要安装winutils - 一个用于hadoop的Windows库。
只需复制folder中的所有文件 到你的文件夹
%SPARK_HOME%\bin
并添加环境变量%HADOOP_HOME%,其值为%SPARK_HOME%