我一直在玩火花但是我无法理解如何构建这个执行流程。伪代码如下:
from pyspark import SparkConf, SparkContext, SQLContext
sc = SparkContext(conf=conf)
sqlSC = SQLContext(sc)
df1 = getBigDataSetFromDb()
ddf1 = sqlSC.createDataFrame(sc.broadcast(df1))
df2 = getOtherBigDataSetFromDb()
ddf2 = sqlSC.createDataFrame(sc.broadcast(df2))
datesList = sc.parallelize(aListOfDates)
def myComplicatedFunc(cobDate):
filteredDF1 = ddf1.filter(ddf1['BusinessDate'] == cobDate)
filteredDF2 = ddf2.filter(ddf2['BusinessDate'] == cobDate)
#some more complicated stuff that uses filteredDF1 & filteredDF2
return someValue
results = datesList.map(myComplicatedFunc)
然而,我得到的是这样的:
Traceback (most recent call last):
File "/net/nas/SysGrid_Users/John.Richardson/Code/HistoricVars/sparkTest2.py", line 76, in <module>
varResults = distDates.map(varFunc).collect()
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", line 771, in collect
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", line 2379, in _jrdd
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", line 2299, in _prepare_for_python_RDD
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 428, in dumps
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 646, in dumps
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 107, in dump
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 408, in dump
self.save(obj)
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 740, in save_tuple
save(element)
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 199, in save_function
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 236, in save_function_tuple
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 725, in save_tuple
save(element)
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 770, in save_list
self._batch_appends(obj)
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 797, in _batch_appends
save(tmp[0])
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 193, in save_function
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 241, in save_function_tuple
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 841, in _batch_setitems
save(v)
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 542, in save_reduce
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 836, in _batch_setitems
save(v)
File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 495, in save
rv = reduce(self.proto)
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o44.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
我怀疑我会以错误的方式解决这个问题。我认为使用广播变量的意义在于我可以在闭包内部使用。但也许我必须做某种加入呢?
答案 0 :(得分:0)
虽然我同意关于缺乏域名背景的评论,但我认为这不是你想要的:
df2 = getOtherBigDataSetFromDb()
ddf2 = sqlSC.createDataFrame(sc.broadcast(df2))
你没有说明df2
的类型是什么,但我们假设它是一个数组而不是DataFrame
实际上(尽管被命名为df*
)。如果它是一个数组,你可能想要的是:
df2 = getOtherBigDataSetFromDb()
ddf2 = sqlSC.createDataFrame(sc.parallelize(df2))
话虽如此,getOtherBigDataSetFromDb
暗示它实际上是一个大数据集。因此,虽然此流程可以正常工作,但如果您的数据集真的非常大,您可能希望以块的形式使用它。您可以自己编写,或者可能已经存在从您的数据库或选择中读取的库。但无论如何,我认为你的意思是parallelize
而不是broadcast