我想连接pyspark和google colab。 我在云(mlab)上的mongodb中有信息。
使用Google colab,我执行以下脚本:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-eu.apache.org/dist/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz
!tar xf spark-2.3.2-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.2-bin-hadoop2.7"
然后,在我的本地环境中,我用以下代码行执行py-script:
spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.5 wordcount.py
但是,在云上,我无法直接执行它...
确实,初始脚本是:
uri_in = "mongodb://{}:{}@{}.speeches".format(mongo_user, mongo_password, mongo_url)
uri_out = "mongodb://{}:{}@{}.wordcount_out".format(
mongo_user, mongo_password, mongo_url
)
spark = (
SparkSession.builder.appName("discursos.counter")
.config("spark.mongodb.input.uri", uri_in)
.config("spark.mongodb.output.uri", uri_out)
.getOrCreate()
)
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
我可以通过pyspark连接mongodb google-colab吗?
谢谢!