Question

对于我的生活，我无法弄清楚我的PySpark安装有什么问题。我已经安装了所有依赖项，包括Hadoop，但PySpark无法找到它 - 我是否正确诊断了这个？

请参阅下面的完整错误消息，但最终在PySpark SQL上失败

pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"

nickeleres@Nicks-MBP:~$ pyspark
Python 2.7.10 (default, Feb  7 2017, 00:08:15) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/opt/spark-2.2.0/jars/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
17/10/24 21:21:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/24 21:21:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/10/24 21:21:59 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
17/10/24 21:21:59 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
Traceback (most recent call last):
  File "/opt/spark/python/pyspark/shell.py", line 45, in <module>
    spark = SparkSession.builder\
  File "/opt/spark/python/pyspark/sql/session.py", line 179, in getOrCreate
    session._jsparkSession.sessionState().conf().setConfString(key, value)
  File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/opt/spark/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
>>>

Answer 1

tl; dr 关闭所有其他Spark进程并重新开始。

以下WARN消息表明存在另一个保存端口的进程（或多个进程）。

我确信这些流程是Spark流程，例如pyspark会话或Spark应用程序。

17/10/24 21:21:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/10/24 21:21:59 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
17/10/24 21:21:59 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.

这就是为什么在Spark / pyspark发现端口4044可以免费用于Web UI后，它试图实例化HiveSessionStateBuilder并失败。

pyspark失败，因为您无法启动并运行多个使用相同的本地Hive Metastore的Spark应用程序。

Answer 2

为什么会这样？

因为我们尝试多次创建新会话！在jupyter笔记本浏览器的不同选项卡上。

解决方案：

在JUPYTER笔记本中的单个标签上开始新的会话，并避免在不同的标签上创建新的会话

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('EXAMPLE').getOrCreate()

Answer 3

另一个可能的原因是由于未满足最低机器要求，导致火花应用程序无法启动。

在“应用程序历史记录”标签中：

Diagnostics:Uncaught exception: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores < 0, or requested virtual cores > max configured, requestedVirtualCores=5, maxVirtualCores=4

插图：

Answer 4

我们在尝试使用 Jupyter Notebook 创建 Spark 会话时收到了同样的错误。我们注意到，在我们的案例中，用户没有权限触发临时目录，即针对以下 spark 属性值“spark.local.dir”使用的目录。我们更改了目录的权限，以便用户拥有对此的完全访问权限并解决了问题。通常，此目录位于"/tmp/user" 之类的位置。

请注意，根据 spark documentation spark 暂存目录是 Spark 中用于“暂存”空间的“目录”，包括地图输出文件和存储在磁盘上的 RDD。这应该是快速的本地系统中的磁盘。它也可以是不同磁盘上多个目录的逗号分隔列表”。

为什么pyspark在实例化＆＃34; org.apache.spark.sql.hive.HiveSessionStateBuilder＆＃39;＆＃34;？时出现＆＃34;错误？

4 个答案: