Question

我试图在火花群上运行Hail（https://hail.is/）。当我尝试创建HailContext时，我收到一条错误，声称我必须设置两个配置参数：spark.sql.files.openCostInBytes和spark.sql.files.maxPartitionBytes

$ pyspark --jars s3://<bucket_name>/hail-all-spark.jar --conf spark.driver.memory=4g --conf spark.executor.memory=4g 
Python 2.7.13 (default, Jan 31 2018, 00:17:36) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Downloading s3://<bucket_name>/hail-all-spark.jar to /tmp/tmp2718520966373391304/hail/hail-all-spark.jar.
18/03/01 10:19:27 INFO S3NativeFileSystem: Opening 's3://<bucket_name>/hail-all-spark.jar' for reading
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/03/01 10:19:32 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/03/01 10:20:06 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.1
      /_/

Using Python version 2.7.13 (default, Jan 31 2018 00:17:36)
SparkSession available as 'spark'.
>>> sc.addPyFile('s3://<bucket_name>/hail-python.zip')
>>> from hail import *
>>> hc = HailContext(sc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<decorator-gen-478>", line 2, in __init__
  File "/mnt/tmp/spark-eebd27bf-b387-4717-9ae5-e94f81438aee/userFiles-fb511f51-35b3-436a-aa5d-d0d84de40851/hail-python.zip/hail/typecheck/check.py", line 245, in _typecheck
  File "/mnt/tmp/spark-eebd27bf-b387-4717-9ae5-e94f81438aee/userFiles-fb511f51-35b3-436a-aa5d-d0d84de40851/hail-python.zip/hail/context.py", line 88, in __init__
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:is.hail.HailContext.apply.
: is.hail.utils.HailException: Found problems with SparkContext configuration:
  Invalid config parameter 'spark.sql.files.openCostInBytes': too small. Found 0, require at least 50G
  Invalid config parameter 'spark.sql.files.maxPartitionBytes': too small. Found 0, require at least 50G
    at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:6)
    at is.hail.utils.package$.fatal(package.scala:27)
    at is.hail.HailContext$.checkSparkConfiguration(HailContext.scala:116)
    at is.hail.HailContext$.apply(HailContext.scala:169)
    at is.hail.HailContext.apply(HailContext.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

我应该如何正确设置这些参数？使用--conf spark.sql.files.openCostInBytes=60g创建IllegalArgumentException

$ pyspark --jars s3://<bucket_name>/hail-all-spark.jar --conf spark.driver.memory=4g --conf spark.executor.memory=4g --conf spark.sql.files.openCostInBytes=60g
Python 2.7.13 (default, Jan 31 2018, 00:17:36) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Downloading s3://<bucket_name>/hail-all-spark.jar to /tmp/tmp4400881534115197439/hail/hail-all-spark.jar.
18/03/01 10:26:32 INFO S3NativeFileSystem: Opening 's3://<bucket_name>/hail-all-spark.jar' for reading
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/03/01 10:26:38 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/shell.py", line 45, in <module>
    spark = SparkSession.builder\
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 183, in getOrCreate
    session._jsparkSession.sessionState().conf().setConfString(key, value)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"

Answer 1

解决方案是将spark.sql.files.openCostInBytes和spark.sql.files.maxPartitionBytes设置为60000000000而不是60g'：

$ pyspark --conf spark.sql.files.openCostInBytes=60g --conf spark.sql.files.maxPartitionBytes=60g

如何在pyspark中设置spark.sql.files conf

1 个答案: