在Ubuntu上运行pyspark.mllib

时间:2016-07-12 08:09:21

标签: python-2.7 ubuntu apache-spark pyspark apache-spark-mllib

我正在尝试在python中链接Spark。下面的代码是test.py,我把它放在~/spark/python下面:

from pyspark import SparkContext, SparkConf
from pyspark.mllib.fpm import FPGrowth
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
data = sc.textFile("data/mllib/sample_fpgrowth.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
    print(fi)

我运行python test.py得到此错误信息:

Exception in thread "main" java.lang.IllegalStateException: Library directory '/home/user/spark/lib_managed/jars' does not exist.
        at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:249)
        at org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:208)
        at org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:119)
        at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:195)
        at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:121)
        at org.apache.spark.launcher.Main.main(Main.java:86)
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    conf = SparkConf().setAppName(appName).setMaster(master)
  File "/home/user/spark/python/pyspark/conf.py", line 104, in __init__
    SparkContext._ensure_initialized()
  File "/home/user/spark/python/pyspark/context.py", line 245, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway()
  File "/home/user/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway
    raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number

我将test.py移至~/spark,我得到:

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    from pyspark import SparkContext, SparkConf
ImportError: No module named pyspark

我从官方网站克隆Spark项目。 操作系统:Ubuntu Java版本:1.7.0_79 Python版本:2.7.11

有人能给我一些解决这个问题的技巧吗?

2 个答案:

答案 0 :(得分:0)

如果您未设置SPARK_HOME并将其lib添加到PYTHONPATH,请检查this

此外,

  

我从官方网站克隆Spark项目

不推荐这样做,因为它可能会产生很多依赖性问题。您可以尝试download预先构建的Hadoop版本,然后使用说明here在本地模式下对其进行测试。

答案 1 :(得分:0)

Spark程序必须通过“Spark-submit”提交。更多信息:Documentation

您应该尝试运行:$SPARK_HOME/bin/spark-submit test.py而不是python test.py