我正在使用Dockerized映像和Jupyter笔记本以及SparkR内核。当我创建SparkR笔记本时,它使用Microsoft R(3.3.2)而不是vanilla CRAN R install(3.2.3)安装。
我正在使用的Docker镜像安装了一些自定义R库和Python pacakages但我没有明确安装Microsoft R.无论我是否可以删除Microsoft R或并排,我如何让我的SparkR内核使用R 的自定义安装?
提前致谢
答案 0 :(得分:2)
除了与Docker相关的问题之外,Jupyter内核的设置在名为kernel.json
的文件中配置,驻留在特定目录(每个内核一个)中,可以使用命令jupyter kernelspec list
查看;例如,我的(Linux)机器就是这种情况:
$ jupyter kernelspec list
Available kernels:
python2 /usr/lib/python2.7/site-packages/ipykernel/resources
caffe /usr/local/share/jupyter/kernels/caffe
ir /usr/local/share/jupyter/kernels/ir
pyspark /usr/local/share/jupyter/kernels/pyspark
pyspark2 /usr/local/share/jupyter/kernels/pyspark2
tensorflow /usr/local/share/jupyter/kernels/tensorflow
再次,作为示例,以下是我的R内核kernel.json
的内容(ir
)
{
"argv": ["/usr/lib64/R/bin/R", "--slave", "-e", "IRkernel::main()", "--args", "{connection_file}"],
"display_name": "R 3.3.2",
"language": "R"
}
这是我的pyspark2
内核的相应文件:
{
"display_name": "PySpark (Spark 2.0)",
"language": "python",
"argv": [
"/opt/intel/intelpython27/bin/python2",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
"PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
"PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
}
}
正如您所看到的,在这两种情况下,argv
的第一个元素是相应语言的可执行文件 - 在我的例子中,我ir
内核的GNU R和我pyspark2
内核的Intel Python 2.7 1}}内核。更改它,以便它指向您的GNU R可执行文件,应解决您的问题。
答案 1 :(得分:0)
要使用自定义R环境,我相信在启动Spark时需要设置以下应用程序属性:
"spark.r.command": "/custom/path/bin/R",
"spark.r.driver.command": "/custom/path/bin/Rscript",
"spark.r.shell.command" : "/custom/path/bin/R"
这里有更完整的文档:https://spark.apache.org/docs/latest/configuration.html#sparkr