如何与PySpark并用功能

时间:2018-09-25 18:19:03

标签: python apache-spark pyspark apache-spark-sql

如何使用PySpark并行化在数据帧的不同筛选器上运行的函数?

例如,在此数据框上,我想保存每个国家/地区的第二个位置。也就是说,行: ('us',2,34) ('ca',3,98)

df = sqlContext.createDataFrame(
[
    ('us', 1, 55),
    ('us', 2, 34),
    ('us', 3, 56),
    ('ca', 3, 98),
    ('ca', 4, 65),
    ('ca', 1, 78)
],
["country", "position", "value"])

df.createOrReplaceTempView("df")

以下代码返回错误:TypeError:“ JavaPackage”对象不可调用

countries_list = df.select(sf.collect_set('country').alias('country')).first()['country']

countries = sc.parallelize(countries_list)

# Process filtered country in df 
def process_country(df):
    position_lst = sorted(df.select(sf.collect_set('position').alias('position')).first()['position'])

    for i in position_lst:
        for j in position_lst:
            if j>i:
                row_to_save = df.filter(col('position').isin([j]))
                row_to_save.write.save('hdfs:///user/folder', format='parquet', mode='overwrite')
                break
        break

# filter main dataframe with country = x
def process(x): 
    df = sqlContext.table("df")
    df = df.filter(col('country').isin([x]) )
    process_country(df)


countries.foreach(process) 

我目前正在通过python中的循环来执行此操作,该循环正在对所有国家/地区进行序列化,因此耗时太长。目标是能够并行处理所有国家。

完整错误:

Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 235, in dump
    return Pickler.dump(self, obj)
  File "/usr/lib64/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 568, in save_tuple
    save(element)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 378, in save_function
    self.save_function_tuple(obj)
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 529, in save_function_tuple
    save(closure_values)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 606, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib64/python2.7/pickle.py", line 639, in _batch_appends
    save(x)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 378, in save_function
    self.save_function_tuple(obj)
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 529, in save_function_tuple
    save(closure_values)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 606, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib64/python2.7/pickle.py", line 639, in _batch_appends
    save(x)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 378, in save_function
    self.save_function_tuple(obj)
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 529, in save_function_tuple
    save(closure_values)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 606, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib64/python2.7/pickle.py", line 639, in _batch_appends
    save(x)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 378, in save_function
    self.save_function_tuple(obj)
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 529, in save_function_tuple
    save(closure_values)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 606, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib64/python2.7/pickle.py", line 642, in _batch_appends
    save(tmp[0])
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 378, in save_function
    self.save_function_tuple(obj)
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 529, in save_function_tuple
    save(closure_values)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 606, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib64/python2.7/pickle.py", line 642, in _batch_appends
    save(tmp[0])
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 378, in save_function
    self.save_function_tuple(obj)
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 529, in save_function_tuple
    save(closure_values)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 606, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib64/python2.7/pickle.py", line 642, in _batch_appends
    save(tmp[0])
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 372, in save_function
    self.save_function_tuple(obj)
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 525, in save_function_tuple
    save(f_globals)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib64/python2.7/pickle.py", line 687, in _batch_setitems
    save(v)
  File "/usr/lib64/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/spark/python/pyspark/cloudpickle.py", line 804, in save_reduce
    save(state)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib64/python2.7/pickle.py", line 687, in _batch_setitems
    save(v)
  File "/usr/lib64/python2.7/pickle.py", line 306, in save
    rv = reduce(self.proto)
TypeError: 'JavaPackage' object is not callable

注意:这是一个简单的示例,实际上实际功能非常不同并且更加复杂。主要问题是弄清楚如何在需要将其应用于数据帧的不同子段的情况下并行化任何类型的函数(避免序列化)。

0 个答案:

没有答案