我正在尝试使用自包含的sparks应用程序(通过Pycharm)在python中执行一些examples。
我使用以下方法安装了pyspark:
pip install pyspark
根据示例的网络,应该足以执行以下操作:
python nameofthefile.py
但是我有这个错误:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:359)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:366)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
at java.base/java.lang.String.substring(String.java:1874)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:52)
... 23 more
Traceback (most recent call last):
File "C:/Users/.../PycharmProjects/PoC/Databricks.py", line 4, in <module>
spark = SparkSession.builder.appName("Databricks").getOrCreate()
File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\sql\session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\context.py", line 349, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\context.py", line 115, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\context.py", line 298, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\Users\...\Desktop\env\lib\site-packages\pyspark\java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
怎么了?
额外
根据您可以找到解决方案的帖子,对于我来说,我不得不从jdk-11更改为jdk1.8。
现在我可以运行示例代码,但是有一个错误(不会阻止它运行)
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:359)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:366)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2019-01-24 08:46:16 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Here是此Could not locate executable null\bin\winutils.exe
为解决第二个问题,只需在控制面板中定义HADOOP_HOME和PATH环境变量,以便任何Windows程序都可以使用它们。
答案 0 :(得分:1)
简短回答:
我有一个类似的问题,可以通过更改JAVA_HOME环境变量配置来解决。 您可以手动添加一个新的用户环境变量JAVA_HOME,以链接到Java开发工具包的路径(诸如“ C:/Progra~1/Java/jdk1.8.0_121”,或“ C:/ Progra〜2 / Java / jdk1 .8.0_121”(如果已安装在Windows的“程序文件(x86)”中)。
您还可以在python代码的开头尝试以下操作:
import os
os.environ["JAVA_HOME"] = "C:/Progra~1/Java/jdk1.8.0_121"
(如果您的JDK安装在“程序文件(x86)”下,则为“ C:/Progra~2/Java/jdk1.8.0_121”
更长的答案: 您是否独立于Pyspark安装了Spark二进制文件(包括hadoop)? 您还需要安装兼容的Java开发套件(JDK)(Spark 2.3.0中的Java 8+)。 您还需要配置用户环境变量,例如: JAVA_HOME以及Java开发套件的路径 SPARK_HOME以及SPARK二进制文件的路径 HADOOP_HOME,其中包含hadoop二进制文件的路径
您可以通过python执行以下操作:
import os
os.environ["JAVA_HOME"] = "C:/Progra~2/Java/jdk1.8.0_121"
os.environ["SPARK_HOME"] = "/path/to/spark-2.3.1-bin-hadoop2.7"
然后,我建议使用findspark(可以将其安装到pip install findspark中):https://github.com/minrk/findspark
然后您可以像这样使用它:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
尤其是在Windows上,JAVA_HOME应该类似于:
C:\Progra~1\Java\jdk1.8.0_121
并且,“如果JDK安装在\ Program Files(x86)下,则将Progra〜1部分替换为Progra〜2。”
可在此处找到在Windows上安装的详细信息(适用于jupyter,但spark和pyspark的安装相同): https://changhsinlee.com/install-pyspark-windows-jupyter/
我希望它会有所帮助, 祝你好运,并度过愉快的一天/晚上!