修改jupyter内核以在spark中添加cassandra连接

时间:2018-06-01 09:22:30

标签: apache-spark cassandra pyspark jupyter-notebook

我有一个使用PySpark的Jupyter内核。

> cat kernel.json
{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
 "display_name":"PySpark"
}

我想修改此内核以添加与cassandra的连接。在脚本模式下,我输入:

pyspark \
    --packages anguenot:pyspark-cassandra:0.7.0 \
    --conf spark.cassandra.connection.host=12.34.56.78 \
    --conf spark.cassandra.auth.username=cassandra \
    --conf spark.cassandra.auth.password=cassandra

脚本版本完美无缺。但我想在Jupyter做同样的事情。

我应该在内核中输入这些信息?我已经试过了两个:

{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
 "display_name":"PySpark with Cassandra",
 "spark.jars.packages": "anguenot:pyspark-cassandra:0.7.0",
 "spark.cassandra.connection.host": "12.34.56.78",
 "spark.cassandra.auth.username": "cassandra",
 "spark.cassandra.auth.password": "cassandra"
}

{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
 "display_name":"PySpark with Cassandra",
 "PYSPARK_SUBMIT_ARGS": "--packages anguenot:pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=12.34.56.78 --conf spark.cassandra.auth.username=cassandra --conf spark.cassandra.auth.password=cassandra"
}

他们都没有工作。当我执行:

sqlContext.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(table="my_table", keyspace="my_keyspace")\
    .load()

我收到错误java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra

仅供参考:我没有在笔记本中创建Spark会话。启动内核时sc对象已经存在。

1 个答案:

答案 0 :(得分:0)

必须在之前配置 spark.jars.*选项  SparkContext已初始化。发生这种情况后,配置将无效。这意味着您必须执行以下操作之一:

  • 修改SPARK_HOME/conf/spark-defaults.confSPARK_CONF_DIR/spark-defaults.conf,确保启动内核时SPARK_HOMESPARK_CONF_DIR在范围内。
  • 使用Add Jar to standalone pyspark
  • 中描述的相同方法修改内核初始化代码(初始化SparkContext的位置)

我还强烈推荐Configuring Spark to work with Jupyter Notebook and Anaconda