Pyspark结构化流程序中的空文件处理

时间:2019-11-15 19:33:09

标签: pyspark spark-streaming

当我们尝试基于以下代码成功实现从Kafka主题到HDFS文件的流数据管道时,我们能够做到。但是HDFS目标中正在创建许多空文件。因此,我们尝试在写入目标不起作用并低于错误之前检查空数据帧检查。请让我们知道如何在Pyspark结构化流中处理空数据帧。

参考:How to check if spark dataframe is empty?

  

目标文件路径:

[filepath0@file101 ~]$ hadoop fs -ls /user/filepath0/pda/filepath1/tgt7
Found 16 items
drwxr-xr-x   - filepath0 hdfs          0 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/_spark_metadata
-rw-r--r--   3 filepath0 hdfs          0 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/part-00000-4b6a77d3-9821-4b61-befd-b49c448569ce.csv
-rw-r--r--   3 filepath0 hdfs     677646 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00000-5d53fb08-2f42-4b5e-a3be-81d9c2ad4392.csv
-rw-r--r--   3 filepath0 hdfs      19679 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00000-7bf19e7d-d174-4fac-a6c1-62c37017d151.csv
-rw-r--r--   3 filepath0 hdfs      20349 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00000-7e820837-4b8d-4ce4-8643-c45e1f21f2b3.csv
-rw-r--r--   3 filepath0 hdfs          0 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/part-00000-be52de60-3d39-473c-81ff-dae3aefc3a6d.csv
-rw-r--r--   3 filepath0 hdfs      19679 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/part-00001-27d947ab-5ace-4abd-a17e-6af03080fdca.csv
-rw-r--r--   3 filepath0 hdfs          0 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00001-43888a52-8a55-4fdd-bc79-faa3d43f2604.csv
-rw-r--r--   3 filepath0 hdfs          0 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00001-4f0e7c26-2551-46e8-a721-9ab8b8b8d1c6.csv
-rw-r--r--   3 filepath0 hdfs     657707 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00001-e8c294b4-6280-411e-bcc2-25b056c4e687.csv
-rw-r--r--   3 filepath0 hdfs      19879 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/part-00001-f06b80e0-920e-4c63-891b-e9546f79e09f.csv
-rw-r--r--   3 filepath0 hdfs          0 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00002-2813bb17-ac47-4811-a63d-4de90eb535b2.csv
-rw-r--r--   3 filepath0 hdfs          0 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/part-00002-4219abc1-ff75-48b9-9b98-4a1ea4b9cd3d.csv
-rw-r--r--   3 filepath0 hdfs      19879 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00002-76961eec-c125-4708-9816-c515ca8f5605.csv
-rw-r--r--   3 filepath0 hdfs      20349 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/part-00002-c2de8fa4-7707-4518-a8eb-61dc2af089ee.csv
-rw-r--r--   3 filepath0 hdfs     661457 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00002-e383c97b-0ffc-438c-aef0-05b9ee401e7f.csv

Pyspark代码:

df = spark.readStream \
     .format("kafka") \
     .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
     .option("subscribe", KAFKA_TOPIC_NAME_CONS) \
     .option("startingOffsets", "earliest") \
     .option("kafka.security.protocol","SASL_SSL")\
     .option("kafka.client.id" ,"Clinet_id")\
     .option("kafka.sasl.kerberos.service.name","kafka")\
     .option("kafka.ssl.truststore.location", "/home/path/kafka_trust.jks") \
     .option("kafka.ssl.truststore.password", "password_rd") \
     .option("kafka.sasl.kerberos.keytab","/home/path.keytab") \
     .option("kafka.sasl.kerberos.principal","path") \
     .load()
df1 = df.selectExpr( "CAST(value AS STRING)")

#  Creating  Writestream DataFrame :

if (df1.rdd.isEmpty()) == 'False':
     df1.writeStream \
        .option("path","/targetpath/") \
        .format("csv") \
        .option("checkpointLocation","/checkpointpath/") \
        .outputMode("append") \
        .start()

错误消息:

Traceback (most recent call last):
  File "/home/bdpda/pda/pysparkstructurestreaming1.py", line 43, in <module>
    if (df1.rdd.isEmpty()) == 'False':
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 84, in rdd
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: 'Queries with streaming sources must be executed with writeStream.start();;\nkafka'
19/11/15 14:12:37 INFO SparkContext: Invoking stop() from shutdown hook
19/11/15 14:12:37 INFO ServerConnector: Stopped Spark@70e4539a{HTTP/1.1}{0.0.0.0:4059}
19/11/15 14:12:37 INFO SparkUI: Stopped Spark web UI at http://10.220.3.167:4059

0 个答案:

没有答案