在工作节点

时间:2015-06-25 00:12:56

标签: python numpy apache-spark pyspark

我在cloudera环境中以独立模式运行SPARK 1.3。我可以从ipython笔记本运行pyspark,但是只要我添加第二个工作节点,我的代码就会停止运行并返回错误。 我很确定这是因为我的主设备上的模块对于工作节点是不可见的。 我尝试导入numpy但是它没有用,即使我通过anaconda安装了我的工人numpy。我以同样的方式在master和worker上安装了anaconda。

然而,根据Josh Rosen的建议,我确保在工作节点上安装了库。

https://groups.google.com/forum/#!topic/spark-users/We_F8vlxvq0

然而,我似乎仍然遇到问题。包括我的工人不认识命令abs的事实。这是python 2.6中的标准。

我正在运行的代码来自这篇文章:

https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python

def isprime(n):
    """
    check if integer n is a prime
    """
    # make sure n is a positive integer
    n = abs(int(n))
    # 0 and 1 are not primes
    if n < 2:
        return False
    # 2 is the only even prime number
    if n == 2:
        return True
    # all other even numbers are not primes
    if not n & 1:
        return False
    # range starts with 3 and only needs to go up the square root of n
    # for all odd numbers
    for x in range(3, int(n**0.5)+1, 2):
        if n % x == 0:
            return False
    return True

# Create an RDD of numbers from 0 to 1,000,000
nums = sc.parallelize(xrange(1000000))

# Compute the number of primes in the RDD
print nums.filter(isprime).count()

1 个答案:

答案 0 :(得分:8)

I often use the anaconda distribution with PySpark as well and find it useful to set the PYSPARK_PYTHON variable, pointing to the python binary within the anaconda distribution. I've found that otherwise I get lots of strange errors. You might be able to check with python is being used by running rdd.map(lambda x: sys.executable).distinct().collect(). I suspect it's not pointing to the correct location. In any case, I recommend wrapping the configuration of your path and environment variables in a script. I use the following. def configure_spark(spark_home=None, pyspark_python=None): spark_home = spark_home or "/path/to/default/spark/home" os.environ['SPARK_HOME'] = spark_home # Add the PySpark directories to the Python path: sys.path.insert(1, os.path.join(spark_home, 'python')) sys.path.insert(1, os.path.join(spark_home, 'python', 'pyspark')) sys.path.insert(1, os.path.join(spark_home, 'python', 'build')) # If PySpark isn't specified, use currently running Python binary: pyspark_python = pyspark_python or sys.executable os.environ['PYSPARK_PYTHON'] = pyspark_python When you point to your anaconda binary, you should also be able to import all the packages installed in its site-packages directory. This technique should work for conda environments as well.