将bitarray库导入SparkContext

时间:2016-01-23 03:42:25

标签: python-2.7 apache-spark pyspark

我正在尝试将bitarray库导入SparkContext。 https://pypi.python.org/pypi/bitarray/0.8.1

要做到这一点,我已经压缩了位数组文件夹中的上下文,然后尝试将其添加到我的python文件中。然而,即使我将库推送到节点后,我的RDD也找不到库。这是我的代码

zip bitarray.zip bitarray-0.8.1/bitarray/*

// Check the contents of the zip file 

unzip -l bitarray.zip
Archive:  bitarray.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
   143455  2015-11-06 02:07   bitarray/_bitarray.so
     4440  2015-11-06 02:06   bitarray/__init__.py
     6224  2015-11-06 02:07   bitarray/__init__.pyc
    68516  2015-11-06 02:06   bitarray/test_bitarray.py
    78976  2015-11-06 02:07   bitarray/test_bitarray.pyc
---------                     -------
   301611                     5 files

然后在火花

import os 

# Environment
import findspark
findspark.init("/home/utils/spark-1.6.0/")

import pyspark
sparkConf = pyspark.SparkConf()

sparkConf.set("spark.executor.instances", "2") 
sparkConf.set("spark.executor.memory", "10g")
sparkConf.set("spark.executor.cores", "2")

sc = pyspark.SparkContext(conf = sparkConf)

from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import HiveContext
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import udf

hiveContext = HiveContext(sc)

PYBLOOM_LIB = '/home/ryandevera/pybloom.zip'
sys.path.append(PYBLOOM_LIB)
sc.addPyFile(PYBLOOM_LIB)

from pybloom import BloomFilter
f = BloomFilter(capacity=1000, error_rate=0.001)
x = sc.parallelize([(1,("hello",4)),(2,("goodbye",5)),(3,("hey",6)),(4,("test",7))],2)


def bloom_filter_spark(iterator):
    for id,_ in iterator:
        f.add(id)
    yield (None, f)

x.mapPartitions(bloom_filter_spark).take(1)

这会产生错误 -

ImportError: pybloom requires bitarray >= 0.3.4

我不确定我哪里出错了。任何帮助将不胜感激!

1 个答案:

答案 0 :(得分:2)

您最简单的方法是创建和分发egg文件。假设您已从PyPI下载并解压缩源文件并设置PYBLOOM_SOURCE_DIRBITARRAY_SOURCE_DIR变量:

cd $PYBLOOM_SOURCE_DIR
python setup.py bdist_egg
cd $BITARRAY_SOURCE_DIR
python setup.py bdist_egg

在PySpark中添加:

from itertools import chain
import os
import glob

eggs = chain.from_iterable([
  glob.glob(os.path.join(os.environ[x], "dist/*")) for x in   
  ["PYBLOOM_SOURCE_DIR", "BITARRAY_SOURCE_DIR"]
])

for egg in eggs: sc.addPyFile(egg)

问题是BloomFilter对象无法正确序列化,因此如果您想使用它,则必须对其进行修补或提取bitarrays并传递这些内容:

def buildFilter(iter):
    bf = BloomFilter(capacity=1000, error_rate=0.001)
    for x in iter:
        bf.add(x)
    return [bf.bitarray]

rdd = sc.parallelize(range(100))
rdd.mapPartitions(buildFilter).reduce(lambda x, y: x | y)