使用show()方法时,控制台上没有输出,我的数据框丢失了

时间:2019-07-12 04:37:15

标签: apache-spark pyspark apache-kafka spark-streaming

我是pyspark的初学者,我正在spark中创建一个试点项目,我使用pycharm IDE来开发我的项目,并且在我的IDE上运行得很好。并使用此方法(productInfo = sqlContext.read.json(rdd))将RDD VALUE(这是我的JSON)转换为数据帧,并将RDD转换为DataFrame后在我的本地计算机上正常工作,我正在使用.Show()方法并可以正常工作。

但是当我在EC2(Ubuntu 18.04.2 LTS)中设置所有这些(Kafka,apache-spark)并尝试在到达show()方法时使用spark-submit console stop执行并再次不显示任何内容时出现了问题在show()方法启动和停止时,我无法弄清楚什么是错误,在控制台中未显示任何错误,还检查了我的数据是否在RDD中进来

  1. 还检查要打印我的rdd,这是在打印,以确保我来自kafka的数据即将到来
  2. 当我在EC2 ubuntu 18.4上运行时,我尝试在Pycharm iDE的本地计算机上运行的同一代码运行正常,我的数据帧丢失了
    # coding: utf-8 
    from pyspark import SparkContext
    from pyspark import SparkConf
    from pyspark.streaming import StreamingContext
    from pyspark.streaming.kafka import KafkaUtils
    from pyspark.sql import Row, DataFrame, SQLContext
    import pandas as pd

    def getSqlContextInstance(sparkContext):
        if ('sqlContextSingletonInstance' not in globals()):
            globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
        return globals()['sqlContextSingletonInstance']

    def process(time, rdd):
        print("========= %s =========" % str(time))

    try:
        #print("--------------Also cross check my data is present in rdd I checked by printing ----------------")
        #results = rdd.collect()
        #for result in results:
        #print(result)

        # Get the singleton instance of SparkSession
        sqlContext = getSqlContextInstance(rdd.context)
        productInfo = sqlContext.read.json(rdd)

        # problem comes here when i try to show it
        productInfo.show()
    except:
        pass

    if _name_ == '_main_':
        conf = SparkConf().set("spark.cassandra.connection.host", "127.0.0.1")
        sc = SparkContext(conf = conf)
        sc.setLogLevel("WARN")
        sqlContext = SQLContext(sc)
        ssc = StreamingContext(sc,10)
        kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming', {'new_topic':1})
        lines = kafkaStream.map(lambda x: x[1])
        lines.foreachRDD(process)
        #lines.pprint()
        ssc.start()
        ssc.awaitTermination()

我的控制台:

./spark-submit ReadingJsonFromKafkaAndWritingToScylla_CSV_Example.py
 19/07/10 11:13:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
 19/07/10 11:13:15 INFO SparkContext: Running Spark version 2.4.3
 19/07/10 11:13:15 INFO SparkContext: Submitted application: ReadingJsonFromKafkaAndWritingToScylla_CSV_Example.py
 19/07/10 11:13:15 INFO SecurityManager: Changing view acls to: kafka
 19/07/10 11:13:15 INFO SecurityManager: Changing modify acls to: kafka
 19/07/10 11:13:15 INFO SecurityManager: Changing view acls groups to: 
 19/07/10 11:13:15 INFO SecurityManager: Changing modify acls groups to: 
 19/07/10 11:13:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(kafka); groups with view permissions: Set(); users with modify permissions: Set(kafka); groups with modify permissions: Set()
 19/07/10 11:13:16 INFO Utils: Successfully started service 'sparkDriver' on port 41655.
 19/07/10 11:13:16 INFO SparkEnv: Registering MapOutputTracker
 19/07/10 11:13:16 INFO SparkEnv: Registering BlockManagerMaster
 19/07/10 11:13:16 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
 19/07/10 11:13:16 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
 19/07/10 11:13:16 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-33f848fe-88d7-4c8f-8440-8384e094c59c
 19/07/10 11:13:16 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
 19/07/10 11:13:16 INFO SparkEnv: Registering OutputCommitCoordinator
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
 19/07/10 11:13:16 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
 19/07/10 11:13:16 INFO Utils: Successfully started service 'SparkUI' on port 4046.
 19/07/10 11:13:16 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at [http://ip-172-31-92-134.ec2.internal:4046|http://ip-172-31-92-134.ec2.internal:4046/]
 19/07/10 11:13:16 INFO Executor: Starting executor ID driver on host localhost
 19/07/10 11:13:16 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34719.
 19/07/10 11:13:16 INFO NettyBlockTransferService: Server created on ip-172-31-92-134.ec2.internal:34719
 19/07/10 11:13:16 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
 19/07/10 11:13:16 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ip-172-31-92-134.ec2.internal, 34719, None)
 19/07/10 11:13:16 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-92-134.ec2.internal:34719 with 366.3 MB RAM, BlockManagerId(driver, ip-172-31-92-134.ec2.internal, 34719, None)
 19/07/10 11:13:16 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ip-172-31-92-134.ec2.internal, 34719, None)
 19/07/10 11:13:16 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ip-172-31-92-134.ec2.internal, 34719, None)
 19/07/10 11:13:17 WARN AppInfo$: Can't read Kafka version from MANIFEST.MF. Possible cause: java.lang.NullPointerException
 19/07/10 11:13:18 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
 19/07/10 11:13:18 WARN BlockManager: Block input-0-1562757198000 replicated to only 0 peer(s) instead of 1 peers

这是我不使用kafka主题生成数据时的情况:

========= 2019-07-10 11:13:20 =========
 ---------------------in function procces----------------------
 -----------------------before printing----------------------
 ========= 2019-07-10 11:13:30 =========
 ---------------------in function procces----------------------
 -----------------------before printing----------------------
 ++

++
 ++

------------------------after printing-----------------------
 ========= 2019-07-10 11:13:40 =========
 ---------------------in function procces----------------------
 -----------------------before printing----------------------
 ++

++
 ++

------------------------after printing-----------------------
 ========= 2019-07-10 11:15:40 =========
 ---------------------in function procces----------------------
 -----------------------before printing----------------------
 ++

++
 ++

------------------------after printing-----------------------
 19/07/10 11:15:47 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
 19/07/10 11:15:47 WARN BlockManager: Block input-0-1562757347200 replicated to only 0 peer(s) instead of 1 peers

**This is when I start producing my data in kafka topic:**

========= 2019-07-10 11:15:50 =========
 ---------------------in function procces----------------------
 -----------------------before printing----------------------
 19/07/10 11:15:52 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
 19/07/10 11:15:52 WARN BlockManager: Block input-0-1562757352200 replicated to only 0 peer(s) instead of 1 peers
 19/07/10 11:15:57 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
 19/07/10 11:15:57 WARN BlockManager: Block input-0-1562757357200 replicated to only 0 peer(s) instead of 1 peers
 ========= 2019-07-10 11:16:00 =========
 ---------------------in function procces----------------------
 -----------------------before printing----------------------
 19/07/10 11:16:02 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
 19/07/10 11:16:02 WARN BlockManager: Block input-0-1562757362200 replicated to only 0 peer(s) instead of 1 peers
 19/07/10 11:16:07 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
 19/07/10 11:16:07 WARN BlockManager: Block input-0-1562757367400 replicated to only 0 peer(s) instead of 1 peers
 ========= 2019-07-10 11:16:10 =========
 ---------------------in function procces----------------------
 -----------------------before printing----------------------
 19/07/10 11:16:12 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
 19/07/10 11:16:12 WARN BlockManager: Block input-0-1562757372400 replicated to only 0 peer(s) instead of 1 peers
 19/07/10 11:16:17 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
 19/07/10 11:16:17 WARN BlockManager: Block input-0-1562757377400 replicated to only 0 peer(s) instead of 1 peers

0 个答案:

没有答案