如何使Spark在没有Hadoop的Windows 10上运行?

时间:2018-08-09 12:53:28

标签: apache-spark hadoop pyspark

我正在尝试使Spark在Windows 10上运行,但是我一直遇到错误。我已经进行了详尽的研究,但仍然遇到问题,这是我所做的:

  1. 已安装JDK 1.8。 (效果很好)
  2. 已安装Anaconda3(效果很好)
  3. 解压缩后的Spark 2.3.1
  4. here下载winutils.exe并将其放在.\Hadoop\bin\中(在该文件之外,其余Hadoop文件夹为空-有人告诉我我不需要Hadoop)
  5. 按如下所示设置环境变量:
    1. 用户变量:PATH = .\Continuum\anaconda3
    2. 系统变量:
      • JAVA_HOME = .\Java\jdk1.8.0_161
      • HADOOP_HOME = .\Hadoop
      • PYSPARK_DRIVER_PYTHON = jupyter
      • PYSPARK_DRIVER_PYTHON_OPTS = notebook
      • 路径= .\Java\jdk1.8.0_161\bin; .\Hadoop\bin; .\spark\bin; .\Hadoop\bin\
  6. 我创建了文件夹C:\ tmp \ hive并运行了命令winutils.exe chmod -R 777 \tmp\hive

当我在终端中运行pyspark时,Jupyter可以正常运行并运行代码。但是,以下代码一直在中断。根据我的研究,我相当确定这与我的安装有关。

from datetime import datetime
from pyspark.sql.types import Row

records = sc.parallelize([[1, "Alice", 50],[2, "Bob", 80]])

df = records.toDF()

哪个会导致错误:

Py4JJavaError                             Traceback (most recent call last)
~\spark\python\pyspark\sql\utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

~\spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling o24.applySchemaToPythonRDD.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-;
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
    at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
    at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
    at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
    at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
    at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
    at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
    at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.<init>(HiveSessionStateBuilder.scala:69)
    at org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
    at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79)
    at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
    at org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scala:577)
    at org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:752)
    at org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:737)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
    at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
    at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:114)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
    at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:385)
    at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287)
    at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
    at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
    ... 30 more
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-
    at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
    at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
    ... 45 more


During handling of the above exception, another exception occurred:

AnalysisException                         Traceback (most recent call last)
<ipython-input-5-576476669b32> in <module>()
----> 1 df = records.toDF()

1 个答案:

答案 0 :(得分:1)

所以我想出了什么对我有用:

我仅使用winutils.exe删除了Hadoop存储库,而是从https://github.com/steveloughran/winutils下载了该文件,并将文件夹final Handler handler = new Handler(); handler.postDelayed(new Runnable() { @Override public void run() { if (!timerRunning && timer == null) { timer = new CountDownTimer(300000, 3000) { @Override public void onTick(long l) { timerRunning = true; Log.e(TAG,"Tick every 3 seconds"); } @Override public void onFinish() { timerRunning = false; } }.start(); } } }, 3000); 设置为我的Hadoop-2.7.1环境变量,然后将该文件夹中的bin添加到了{ {1}}并重新启动。