为Jupyter创建pyspark内核

时间:2016-01-25 17:20:42

标签: apache-spark ipython pyspark jupyter

我看着Apache Toree用作Jupyter的Pyspark内核

https://github.com/apache/incubator-toree

然而,它使用旧版本的Spark(1.5.1与当前的1.6.0)。我尝试通过创建kernel.js

http://arnesund.com/2015/09/21/spark-cluster-on-openstack-with-multi-user-jupyter-notebook/使用此方法
{
 "display_name": "PySpark",
 "language": "python",
 "argv": [
  "/usr/bin/python",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "SPARK_HOME": "/usr/local/Cellar/apache-spark/1.6.0/libexec",
  "PYTHONPATH": "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/:/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip",
  "PYTHONSTARTUP": "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS": "--master local[*] pyspark-shell"
 }
}

然而,我遇到的问题很少:

  1. 我的Mac中没有/jupyter/kernels路径。所以我最终创建了这条路径~/.jupyter/kernels/pyspark。我不确定这是不是正确的道路。

  2. 即使拥有所有正确的路径,我仍然没有看到PySpark在Jupyter中显示为内核。

  3. 我错过了什么?

2 个答案:

答案 0 :(得分:20)

使用python内核启动jupyter notebook,然后运行以下命令在Jupyter中初始化pyspark。

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext()

仅供参考:我已尝试使用大部分配置在Jupyter中使用pyspark内核启动Apache Toree但未成功,

答案 1 :(得分:1)

Jupyter内核应该进入$ JUPYTER_DATA_DIR。在OSX上,这是〜/ Library / Jupyter。请参阅:http://jupyter.readthedocs.org/en/latest/system.html