在Docker上运行时,带有Neo4j的Spark流媒体挂起

时间:2018-06-27 13:33:58

标签: apache-spark docker neo4j docker-compose

当我仅从bash脚本运行应用程序时,便已创建了该应用程序的docker映像,它可以正常工作。但是,当我将其作为docker-compose文件的一部分运行时,该应用程序会挂在消息上:

18/06/27 13:17:18 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint

即使我等了一会儿,心跳流还是超时了。为何使用Docker这样的Spark Streaming + Neo4j应用程序性能可能是什么原因,以及如何对其进行改进?

我的应用程序的docker-compose文件:

version: '3.3'
services:
  consumer-demo:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        - ARG_CLASS=consumer
        - HOST=neo4jdb
    volumes:
      - ./:/workdir
    working_dir: /workdir
    restart: always

所有应用程序的整体docker-compose文件:

version: '3.3'
services:
  kafka:
    image: spotify/kafka
    ports:
     - "9092:9092"
    networks:
      - docker_elk
    environment:
    - ADVERTISED_HOST=localhost
  neo4jdb:
    image: neo4j:latest
    container_name: neo4jdb
    ports:
      - "7474:7474"
      - "7473:7473"
      - "7687:7687"
    networks:
      - docker_elk
    volumes:
      - /var/lib/neo4j/import:/var/lib/neo4j/import
      - /var/lib/neo4j/data:/data
      - /var/lib/neo4j/conf:/conf
    environment:
      - NEO4J_dbms_active__database=graphImport.db
  elasticsearch:
    image: elasticsearch:latest
    ports:
      - "9200:9200"
      - "9300:9300"
    networks:
      - docker_elk
    volumes:
        - esdata1:/usr/share/elasticsearch/data
  kibana:
    image: kibana:latest
    ports:
      - "5601:5601"
    networks:
      - docker_elk
volumes:
  esdata1:
    driver: local

networks:
  docker_elk:
    driver: bridge

应用程序可以正常使用的bash脚本:

#!/usr/bin/env bash
if [ "$1" = "consumer" ]
then
    java -cp "jars/spark_consumer.jar" consumer.SparkConsumer 
else
    echo "Wrong parameter. It should be consumer or producer, but it is $1"
fi

应用程序Dockerfile可能是应用程序执行速度降低的原因:

FROM java:8
ARG ARG_CLASS
ARG HOST
ENV MAIN_CLASS $ARG_CLASS
ENV SCALA_VERSION 2.11.8
ENV SBT_VERSION 1.1.1
ENV SPARK_VERSION 2.2.0
ENV SPARK_DIST spark-$SPARK_VERSION-bin-hadoop2.6
ENV SPARK_ARCH $SPARK_DIST.tgz
ENV HOSTNAME bolt://$HOST:7687
VOLUME /workdir

WORKDIR /opt

# Install Scala
RUN \
  cd /root && \
  curl -o scala-$SCALA_VERSION.tgz http://downloads.typesafe.com/scala/$SCALA_VERSION/scala-$SCALA_VERSION.tgz && \
  tar -xf scala-$SCALA_VERSION.tgz && \
  rm scala-$SCALA_VERSION.tgz && \
  echo >> /root/.bashrc && \
  echo 'export PATH=~/scala-$SCALA_VERSION/bin:$PATH' >> /root/.bashrc

# Install SBT
RUN \
  curl -L -o sbt-$SBT_VERSION.deb https://dl.bintray.com/sbt/debian/sbt-$SBT_VERSION.deb && \
  dpkg -i sbt-$SBT_VERSION.deb && \
  rm sbt-$SBT_VERSION.deb


# Install Spark
RUN \
    cd /opt && \
    curl -o $SPARK_ARCH http://d3kbcqa49mib13.cloudfront.net/$SPARK_ARCH && \
    tar xvfz $SPARK_ARCH && \
    rm $SPARK_ARCH && \
    echo 'export PATH=$SPARK_DIST/bin:$PATH' >> /root/.bashrc

EXPOSE 9851 9852 4040 9092 9200 9300 5601 7474 7687 7473

CMD /workdir/runDemo.sh "$MAIN_CLASS" 

1 个答案:

答案 0 :(得分:0)

问题在于,另一个Spark进程正在计算机上运行,​​阻止了Spark数据流。我用ps aux | grep spark检查了所有进程,发现另一个正在运行的进程。只需杀死该进程并重新启动Spark Streaming应用程序即可解决问题。