我试图设置与spark集成的jupyter-notebook,我在本地机器上制作了母版,为了练习,工人也在我的机器上制作了。 但是,当我尝试通过jupyter运行该应用程序时,该应用程序就无法执行df.show()
Dockerfile:
# Copyright (c) Jupyter Development Team.
# Distributed under the terms of the Modified BSD License.
ARG BASE_CONTAINER=jupyter/scipy-notebook
FROM $BASE_CONTAINER
LABEL maintainer="Jupyter Project <jupyter@googlegroups.com>"
USER root
# Spark dependencies
ENV SPARK_VERSION 2.3.2
ENV SPARK_HADOOP_PROFILE 2.7
ENV SPARK_SRC_URL https://www.apache.org/dist/spark/spark-$SPARK_VERSION/spark-${SPARK_VERSION}-bin-hadoop${SPARK_HADOOP_PROFILE}.tgz
ENV SPARK_HOME=/opt/spark
ENV PATH $PATH:$SPARK_HOME/bin
RUN apt-get update && \
apt-get install -y openjdk-8-jdk-headless \
postgresql && \
rm -rf /var/lib/apt/lists/*
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
ENV PATH $PATH:$JAVA_HOME/bin
RUN wget ${SPARK_SRC_URL}
RUN tar -xzf spark-${SPARK_VERSION}-bin-hadoop${SPARK_HADOOP_PROFILE}.tgz
RUN mv spark-${SPARK_VERSION}-bin-hadoop${SPARK_HADOOP_PROFILE} /opt/spark
RUN rm -f spark-${SPARK_VERSION}-bin-hadoop${SPARK_HADOOP_PROFILE}.tgz
ENV SPARK_MASTER local[*]
ENV SPARK_DRIVER_PORT 5001
ENV SPARK_UI_PORT 5002
ENV SPARK_BLOCKMGR_PORT 5003
EXPOSE $SPARK_DRIVER_PORT $SPARK_UI_PORT $SPARK_BLOCKMGR_PORT
USER $NB_UID
ENV POST_URL https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
RUN wget ${POST_URL}
RUN mv postgresql-42.2.5.jar $SPARK_HOME/jars
# Install pyarrow
RUN conda install --quiet -y 'pyarrow' && \
conda clean -tipsy && \
fix-permissions $CONDA_DIR && \
fix-permissions /home/$NB_USER
WORKDIR $SPARK_HOME
运行以下命令:docker build -t my_notebook。
docker-compose.yml(master):
master:
image: my_notebook
command: bin/spark-class org.apache.spark.deploy.master.Master -h master
hostname: master
environment:
MASTER: spark://master:7077
SPARK_CONF_DIR: /conf
SPARK_PUBLIC_DNS: localhost
expose:
- 7001
- 7002
- 7003
- 7004
- 7005
- 7077
- 6066
ports:
- 4040:4040
- 6066:6066
- 7077:7077
- 8080:8080
volumes:
- ./conf/master:/conf
- ./data:/tmp/data
docker-compose.yml(工作者):
worker:
image: my_notebook
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://192.168.1.129:7077
hostname: worker
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 4
SPARK_WORKER_MEMORY: 4g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: localhost
expose:
- 7012
- 7013
- 7014
- 7015
- 8881
ports:
- 8081:8081
volumes:
- ./conf/worker:/conf
- ./data:/tmp/data
Jupyter代码:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql import DataFrameReader
conf = SparkConf().setAppName('Kiwi Data Application')
conf.set('spark.executor.memory', '1G')
conf.set('spark.executor.cores', '2')
sc = SparkContext(master="spark://localhost:7077", conf=conf)
SparkSession.builder.config(conf=SparkConf()).getOrCreate()
sqlContext = SQLContext(sc)
print('sql context')
# Define JDBC properties for DB Connection
url = "postgresql://IP:PORT/gpdb_qa"
properties = {
"user": "user",
"password": "pass",
"fetchsize": "100000"
}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url,
table=query
, properties=properties
)
print('read')
df.show()
主日志:
master_1 | 2019-01-02 06:48:11 INFO Utils:54 - Successfully started service 'sparkMaster' on port 7077.
master_1 | 2019-01-02 06:48:11 INFO Master:54 - Starting Spark master at spark://master:7077
master_1 | 2019-01-02 06:48:11 INFO Master:54 - Running Spark version 2.3.2
master_1 | 2019-01-02 06:48:11 INFO log:192 - Logging initialized @5563ms
master_1 | 2019-01-02 06:48:11 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
master_1 | 2019-01-02 06:48:11 INFO Server:419 - Started @5640ms
master_1 | 2019-01-02 06:48:11 INFO AbstractConnector:278 - Started ServerConnector@43cb4127{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}
master_1 | 2019-01-02 06:48:11 INFO Utils:54 - Successfully started service 'MasterUI' on port 8080.
master_1 | 2019-01-02 06:48:11 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2bd387be{/app,null,AVAILABLE,@Spark}
master_1 | 2019-01-02 06:48:11 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6256c056{/app/json,null,AVAILABLE,@Spark}
master_1 | 2019-01-02 06:48:11 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7b2c2e74{/,null,AVAILABLE,@Spark}
master_1 | 2019-01-02 06:48:11 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6ca8c5ad{/json,null,AVAILABLE,@Spark}
master_1 | 2019-01-02 06:48:11 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3828fc1e{/static,null,AVAILABLE,@Spark}
master_1 | 2019-01-02 06:48:11 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@780ebb19{/app/kill,null,AVAILABLE,@Spark}
master_1 | 2019-01-02 06:48:11 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a3c71cf{/driver/kill,null,AVAILABLE,@Spark}
master_1 | 2019-01-02 06:48:11 INFO MasterWebUI:54 - Bound MasterWebUI to 0.0.0.0, and started at http://localhost:8080
master_1 | 2019-01-02 06:48:11 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
master_1 | 2019-01-02 06:48:11 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@10529071{/,null,AVAILABLE}
master_1 | 2019-01-02 06:48:11 INFO AbstractConnector:278 - Started ServerConnector@2699a66b{HTTP/1.1,[http/1.1]}{master:6066}
master_1 | 2019-01-02 06:48:11 INFO Server:419 - Started @5835ms
master_1 | 2019-01-02 06:48:11 INFO Utils:54 - Successfully started service on port 6066.
master_1 | 2019-01-02 06:48:11 INFO StandaloneRestServer:54 - Started REST server for submitting applications on port 6066
master_1 | 2019-01-02 06:48:12 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@201a4303{/metrics/master/json,null,AVAILABLE,@Spark}
master_1 | 2019-01-02 06:48:12 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3e4a39d0{/metrics/applications/json,null,AVAILABLE,@Spark}
master_1 | 2019-01-02 06:48:12 INFO Master:54 - I have been elected leader! New state: ALIVE
master_1 | 2019-01-02 06:48:32 INFO Master:54 - Registering worker 172.17.0.4:8881 with 2 cores, 12.0 GB RAM
master_1 | 2019-01-02 06:49:29 INFO Master:54 - Registering app Kiwi Data Application
master_1 | 2019-01-02 06:49:29 INFO Master:54 - Registered app Kiwi Data Application with ID app-20190102064929-0000
master_1 | 2019-01-02 06:49:29 INFO Master:54 - Launching executor app-20190102064929-0000/0 on worker worker-20190102064831-172.17.0.4-8881
master_1 | 2019-01-02 06:49:32 INFO Master:54 - Removing executor app-20190102064929-0000/0 because it is EXITED
master_1 | 2019-01-02 06:49:32 INFO Master:54 - Launching executor app-20190102064929-0000/1 on worker worker-20190102064831-172.17.0.4-8881
master_1 | 2019-01-02 06:49:34 INFO Master:54 - Removing executor app-20190102064929-0000/1 because it is EXITED
工人日志:
worker_1 | 2019-01-02 06:48:32 INFO Worker:54 - Successfully registered with master spark://master:7077
worker_1 | 2019-01-02 06:49:29 INFO Worker:54 - Asked to launch executor app-20190102064929-0000/0 for Kiwi Data Application
worker_1 | 2019-01-02 06:49:29 INFO SecurityManager:54 - Changing view acls to: jovyan
worker_1 | 2019-01-02 06:49:29 INFO SecurityManager:54 - Changing modify acls to: jovyan
worker_1 | 2019-01-02 06:49:29 INFO SecurityManager:54 - Changing view acls groups to:
worker_1 | 2019-01-02 06:49:29 INFO SecurityManager:54 - Changing modify acls groups to:
worker_1 | 2019-01-02 06:49:29 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jovyan); groups with view permissions: Set(); users with modify permissions: Set(jovyan); groups with modify permissions: Set()
worker_1 | 2019-01-02 06:49:29 INFO ExecutorRunner:54 - Launch command: "/usr/lib/jvm/java-8-openjdk-amd64//bin/java" "-cp" "/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=41017" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@09a92e44f4de:41017" "--executor-id" "0" "--hostname" "172.17.0.4" "--cores" "2" "--app-id" "app-20190102064929-0000" "--worker-url" "spark://Worker@172.17.0.4:8881"
worker_1 | 2019-01-02 06:49:32 INFO Worker:54 - Executor app-20190102064929-0000/0 finished with state EXITED message Command exited with code 1 exitStatus 1
worker_1 | 2019-01-02 06:49:32 INFO Worker:54 - Asked to launch executor app-20190102064929-0000/1 for Kiwi Data Application
worker_1 | 2019-01-02 06:49:32 INFO SecurityManager:54 - Changing view acls to: jovyan
worker_1 | 2019-01-02 06:49:32 INFO SecurityManager:54 - Changing modify acls to: jovyan
worker_1 | 2019-01-02 06:49:32 INFO SecurityManager:54 - Changing view acls groups to:
worker_1 | 2019-01-02 06:49:32 INFO SecurityManager:54 - Changing modify acls groups to:
worker_1 | 2019-01-02 06:49:32 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jovyan); groups with view permissions: Set(); users with modify permissions: Set(jovyan); groups with modify permissions: Set()
worker_1 | 2019-01-02 06:49:32 INFO ExecutorRunner:54 - Launch command: "/usr/lib/jvm/java-8-openjdk-amd64//bin/java" "-cp" "/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=41017" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@09a92e44f4de:41017" "--executor-id" "1" "--hostname" "172.17.0.4" "--cores" "2" "--app-id" "app-20190102064929-0000" "--worker-url" "spark://Worker@172.17.0.4:8881"
worker_1 | 2019-01-02 06:49:34 INFO Worker:54 - Executor app-20190102064929-0000/1 finished with state EXITED message Command exited with code 1 exitStatus 1
worker_1 | 2019-01-02 06:49:34 INFO Worker:54 - Asked to launch executor app-20190102064929-0000/2 for Kiwi Data Application
worker_1 | 2019-01-02 06:49:34 INFO SecurityManager:54 - Changing view acls to: jovyan
worker_1 | 2019-01-02 06:49:34 INFO SecurityManager:54 - Changing modify acls to: jovyan
worker_1 | 2019-01-02 06:49:34 INFO SecurityManager:54 - Changing view acls groups to:
worker_1 | 2019-01-02 06:49:34 INFO SecurityManager:54 - Changing modify acls groups to:
worker_1 | 2019-01-02 06:49:34 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jovyan); groups with view permissions: Set(); users with modify permissions: Set(jovyan); groups with modify permissions: Set()
Jupyter笔记本(应用程序日志):
[Stage 0:> (0 + 0) / 1]2019-01-02 05:22:53 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
notebook_1 | 2019-01-02 05:23:08 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
notebook_1 | 2019-01-02 05:23:23 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
notebook_1 | 2019-01-02 05:23:38 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
[Stage 0:> (0 + 0) / 1]2019-01-02 05:23:53 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
notebook_1 | 2019-01-02 05:24:08 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
notebook_1 | 2019-01-02 05:24:23 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Spark worker stderr日志:
Spark Executor Command: "/usr/lib/jvm/java-8-openjdk-amd64//bin/java" "-cp" "/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=35147" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@notebook:35147" "--executor-id" "31" "--hostname" "172.17.0.3" "--cores" "2" "--app-id" "app-20190101134023-0001" "--worker-url" "spark://Worker@172.17.0.3:8881"
========================================
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:63)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:63)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
... 4 more
Caused by: java.io.IOException: Failed to connect to notebook:35147
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: notebook
at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)
at java.net.InetAddress.getByName(InetAddress.java:1077)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
at java.security.AccessController.doPrivileged(Native Method)
at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
如果我做错了什么,请引导我