我试图在pyspark中测试ts-flint(https://github.com/twosigma/flint)。
这个库依赖于flint-0.6.0.jar
,我从maven站点下载了它并放入了hdfs。但是出现错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/java.py in new_reader(self)
37 try:
---> 38 return utils.jvm(self.sc).com.twosigma.flint.timeseries.io.read.TSReadBuilder()
39 except TypeError:
TypeError: 'JavaPackage' object is not callable
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-66-87fea234d827> in <module>
----> 1 df = fc.read.options(timeColumn="ds").parquet(TB.SALECOUNT_OUT, is_sorted=False)
~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/context.py in read(self)
82 '''
83
---> 84 return readwriter.TSDataFrameReader(self)
~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/readwriter.py in __init__(self, flintContext)
49 self._sqlContext = self._flintContext._sqlContext
50 self._jpkg = java.Packages(self._sc)
---> 51 self._reader = self._jpkg.new_reader()
52 self._parameters = self._reader.parameters()
53
~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/java.py in new_reader(self)
38 return utils.jvm(self.sc).com.twosigma.flint.timeseries.io.read.TSReadBuilder()
39 except TypeError:
---> 40 return utils.jvm(self.sc).com.twosigma.flint.timeseries.io.read.ReadBuilder()
41
42 @property
TypeError: 'JavaPackage' object is not callable
关于此问题,没有解决方案:https://github.com/twosigma/flint/issues/77
我以前从未包含过jar,而且我对java并不熟悉,所以我不知道出了什么问题。
没有办法在pyspark中列出jar(只能在scala中执行),我只能提供spark配置spark.sparkContext.getConf().getAll()
:
[('spark.eventLog.enabled', 'true'),
('spark.yarn.appMasterEnv.MKL_NUM_THREADS', '1'),
('spark.driver.extraClassPath', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
('spark.sql.queryExecutionListeners',
'com.cloudera.spark.lineage.NavigatorQueryListener'),
('spark.driver.appUIAddress', 'http://dc07:4040'),
('spark.yarn.am.extraLibraryPath',
'/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hadoop/lib/native'),
('spark.driver.memory', '5g'),
('spark.lineage.log.dir', '/var/log/spark/lineage'),
('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
('spark.executorEnv.PYTHONPATH',
'/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/py4j-0.10.7-src.zip<CPS>/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/pyspark.zip'),
('spark.driver.host', 'dc07'),
('spark.yarn.dist.files', ''),
('spark.executor.cores', '3'),
('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.RM_HA_URLS',
'dc05:8088,dc34:8088'),
('spark.ui.filters',
'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'),
('spark.executor.extraClassPath', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
('spark.network.crypto.enabled', 'false'),
('spark.executorEnv.MKL_NUM_THREADS', '1'),
('spark.executor.memory', '4g'),
('spark.ui.enabled', 'true'),
('spark.app.name', 'test_report'),
('spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON', '/opt/conda/bin/python'),
('spark.executor.id', 'driver'),
('spark.dynamicAllocation.schedulerBacklogTimeout', '1'),
('spark.yarn.keytab', '/etc/security/hdfs.keytab'),
('spark.yarn.config.gatewayPath', '/opt/cloudera/parcels'),
('spark.extraListeners', 'com.cloudera.spark.lineage.NavigatorAppListener'),
('spark.executor.memoryOverhead', '2g'),
('spark.yarn.jars',
'local:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/jars/*,local:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/hive/*'),
('spark.sql.warehouse.dir', '/user/hive/warehouse'),
('spark.yarn.appMasterEnv.PYSPARK_PYTHON', '/opt/conda/bin/python'),
('spark.driver.log.persistToDfs.enabled', 'true'),
('spark.jars.packages', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
('spark.yarn.config.replacementPath', '{{HADOOP_COMMON_HOME}}/../../..'),
('spark.executorEnv.OPENBLAS_NUM_THREADS', '1'),
('spark.yarn.executorEnv.PYSPARK_DRIVER_PYTHON', '/opt/conda/bin/python'),
('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES',
'http://dc05:8088/proxy/application_1574228860079_0369,http://dc34:8088/proxy/application_1574228860079_0369'),
('spark.ui.proxyBase', '/proxy/application_1574228860079_0366'),
('spark.port.maxRetries', '100'),
('spark.dynamicAllocation.enabled', 'false'),
('spark.files', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
('spark.executor.extraLibraryPath',
'/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hadoop/lib/native'),
('spark.app.id', 'application_1574228860079_0369'),
('spark.sql.session.timeZone', 'Asia/Shanghai'),
('spark.yarn.principal', 'hdfs@maskOFFICE.INTERNAL'),
('spark.ui.killEnabled', 'true'),
('spark.dynamicAllocation.executorIdleTimeout', '60'),
('spark.io.encryption.enabled', 'false'),
('spark.authenticate', 'false'),
('spark.serializer.objectStreamReset', '100'),
('spark.submit.deployMode', 'client'),
('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS',
'dc05,dc34'),
('spark.driver.port', '37057'),
('spark.shuffle.service.enabled', 'true'),
('spark.yarn.historyServer.allowTracking', 'true'),
('spark.driver.extraLibraryPath',
'/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hadoop/lib/native'),
('spark.yarn.appMasterEnv.OPENBLAS_NUM_THREADS', '1'),
('spark.shuffle.service.port', '7337'),
('spark.driver.maxResultSize', '5g'),
('spark.lineage.enabled', 'true'),
('spark.yarn.historyServer.address', 'http://dc09:18088'),
('spark.master', 'yarn'),
('spark.executor.instances', '5'),
('spark.rdd.compress', 'True'),
('spark.kryoserializer.buffer.max', '1024'),
('spark.dynamicAllocation.minExecutors', '0'),
('spark.yarn.isPython', 'true'),
('spark.eventLog.dir', 'hdfs://maskxdc/user/spark/applicationHistory'),
('spark.ui.showConsoleProgress', 'true'),
('spark.yarn.executorEnv.PYSPARK_PYTHON', '/opt/conda/bin/python'),
('spark.driver.log.dfsDir', '/user/spark/driverLogs')]
与jar相关的配置是
('spark.driver.extraClassPath', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
('spark.jars', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
('spark.executor.extraClassPath', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
('spark.jars.packages', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
('spark.files', 'hdfs://maskxdc/test/flint-0.6.0.jar'),
对我来说看起来不错,我在可以想象的任何地方都添加了flint-0.6.0.jar
。
我该如何解决这个问题?
它在外壳pyspark --jars hdfs://maskxdc/test/flint-0.6.0.jar
中工作
答案 0 :(得分:0)
好的,问题似乎主要是由jupyter / spark会话缓存引起的。
重新启动jupyter笔记本,并使用{"spark.jars" : "hdfs://maskxdc/test/flint-0.6.0.jar" }
创建一个全新的spark会话,以解决该问题。
它实际上只需要配置spark.jars
。