Question

我试图在pyspark中测试ts-flint（https://github.com/twosigma/flint）。这个库依赖于flint-0.6.0.jar，我从maven站点下载了它并放入了hdfs。但是出现错误：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/java.py in new_reader(self)
     37         try:
---> 38             return utils.jvm(self.sc).com.twosigma.flint.timeseries.io.read.TSReadBuilder()
     39         except TypeError:

TypeError: 'JavaPackage' object is not callable

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-66-87fea234d827> in <module>
----> 1 df = fc.read.options(timeColumn="ds").parquet(TB.SALECOUNT_OUT, is_sorted=False)

~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/context.py in read(self)
     82         '''
     83 
---> 84         return readwriter.TSDataFrameReader(self)

~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/readwriter.py in __init__(self, flintContext)
     49         self._sqlContext = self._flintContext._sqlContext
     50         self._jpkg = java.Packages(self._sc)
---> 51         self._reader = self._jpkg.new_reader()
     52         self._parameters = self._reader.parameters()
     53 

~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/java.py in new_reader(self)
     38             return utils.jvm(self.sc).com.twosigma.flint.timeseries.io.read.TSReadBuilder()
     39         except TypeError:
---> 40             return utils.jvm(self.sc).com.twosigma.flint.timeseries.io.read.ReadBuilder()
     41 
     42     @property

TypeError: 'JavaPackage' object is not callable

关于此问题，没有解决方案：https://github.com/twosigma/flint/issues/77

我以前从未包含过jar，而且我对java并不熟悉，所以我不知道出了什么问题。

没有办法在pyspark中列出jar（只能在scala中执行），我只能提供spark配置spark.sparkContext.getConf().getAll()：

[('spark.eventLog.enabled', 'true'),
 ('spark.yarn.appMasterEnv.MKL_NUM_THREADS', '1'),
 ('spark.driver.extraClassPath', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
 ('spark.sql.queryExecutionListeners',
  'com.cloudera.spark.lineage.NavigatorQueryListener'),
 ('spark.driver.appUIAddress', 'http://dc07:4040'),
 ('spark.yarn.am.extraLibraryPath',
  '/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hadoop/lib/native'),
 ('spark.driver.memory', '5g'),
 ('spark.lineage.log.dir', '/var/log/spark/lineage'),
 ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
 ('spark.executorEnv.PYTHONPATH',
  '/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/py4j-0.10.7-src.zip<CPS>/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/pyspark.zip'),
 ('spark.driver.host', 'dc07'),
 ('spark.yarn.dist.files', ''),
 ('spark.executor.cores', '3'),
 ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.RM_HA_URLS',
  'dc05:8088,dc34:8088'),
 ('spark.ui.filters',
  'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'),
 ('spark.executor.extraClassPath', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
 ('spark.network.crypto.enabled', 'false'),
 ('spark.executorEnv.MKL_NUM_THREADS', '1'),
 ('spark.executor.memory', '4g'),
 ('spark.ui.enabled', 'true'),
 ('spark.app.name', 'test_report'),
 ('spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON', '/opt/conda/bin/python'),
 ('spark.executor.id', 'driver'),
 ('spark.dynamicAllocation.schedulerBacklogTimeout', '1'),
 ('spark.yarn.keytab', '/etc/security/hdfs.keytab'),
 ('spark.yarn.config.gatewayPath', '/opt/cloudera/parcels'),
 ('spark.extraListeners', 'com.cloudera.spark.lineage.NavigatorAppListener'),
 ('spark.executor.memoryOverhead', '2g'),
 ('spark.yarn.jars',
  'local:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/jars/*,local:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/hive/*'),
 ('spark.sql.warehouse.dir', '/user/hive/warehouse'),
 ('spark.yarn.appMasterEnv.PYSPARK_PYTHON', '/opt/conda/bin/python'),
 ('spark.driver.log.persistToDfs.enabled', 'true'),
 ('spark.jars.packages', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
 ('spark.yarn.config.replacementPath', '{{HADOOP_COMMON_HOME}}/../../..'),
 ('spark.executorEnv.OPENBLAS_NUM_THREADS', '1'),
 ('spark.yarn.executorEnv.PYSPARK_DRIVER_PYTHON', '/opt/conda/bin/python'),
 ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES',
  'http://dc05:8088/proxy/application_1574228860079_0369,http://dc34:8088/proxy/application_1574228860079_0369'),
 ('spark.ui.proxyBase', '/proxy/application_1574228860079_0366'),
 ('spark.port.maxRetries', '100'),
 ('spark.dynamicAllocation.enabled', 'false'),
 ('spark.files', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
 ('spark.executor.extraLibraryPath',
  '/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hadoop/lib/native'),
 ('spark.app.id', 'application_1574228860079_0369'),
 ('spark.sql.session.timeZone', 'Asia/Shanghai'),
 ('spark.yarn.principal', 'hdfs@maskOFFICE.INTERNAL'),
 ('spark.ui.killEnabled', 'true'),
 ('spark.dynamicAllocation.executorIdleTimeout', '60'),
 ('spark.io.encryption.enabled', 'false'),
 ('spark.authenticate', 'false'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.submit.deployMode', 'client'),
 ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS',
  'dc05,dc34'),
 ('spark.driver.port', '37057'),
 ('spark.shuffle.service.enabled', 'true'),
 ('spark.yarn.historyServer.allowTracking', 'true'),
 ('spark.driver.extraLibraryPath',
  '/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hadoop/lib/native'),
 ('spark.yarn.appMasterEnv.OPENBLAS_NUM_THREADS', '1'),
 ('spark.shuffle.service.port', '7337'),
 ('spark.driver.maxResultSize', '5g'),
 ('spark.lineage.enabled', 'true'),
 ('spark.yarn.historyServer.address', 'http://dc09:18088'),
 ('spark.master', 'yarn'),
 ('spark.executor.instances', '5'),
 ('spark.rdd.compress', 'True'),
 ('spark.kryoserializer.buffer.max', '1024'),
 ('spark.dynamicAllocation.minExecutors', '0'),
 ('spark.yarn.isPython', 'true'),
 ('spark.eventLog.dir', 'hdfs://maskxdc/user/spark/applicationHistory'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.yarn.executorEnv.PYSPARK_PYTHON', '/opt/conda/bin/python'),
 ('spark.driver.log.dfsDir', '/user/spark/driverLogs')]

与jar相关的配置是

 ('spark.driver.extraClassPath', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
 ('spark.jars', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
 ('spark.executor.extraClassPath', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
 ('spark.jars.packages', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
 ('spark.files', 'hdfs://maskxdc/test/flint-0.6.0.jar'),

对我来说看起来不错，我在可以想象的任何地方都添加了flint-0.6.0.jar。我该如何解决这个问题？

更新

它在外壳pyspark --jars hdfs://maskxdc/test/flint-0.6.0.jar中工作

Answer 1

好的，问题似乎主要是由jupyter / spark会话缓存引起的。

重新启动jupyter笔记本，并使用{"spark.jars" : "hdfs://maskxdc/test/flint-0.6.0.jar" }创建一个全新的spark会话，以解决该问题。

它实际上只需要配置spark.jars。

TypeError：'JavaPackage'对象不可调用，spark无法找到罐子

更新

1 个答案: