当我们尝试基于以下代码成功实现从Kafka主题到HDFS文件的流数据管道时,我们能够做到。但是HDFS目标中正在创建许多空文件。因此,我们尝试在写入目标不起作用并低于错误之前检查空数据帧检查。请让我们知道如何在Pyspark结构化流中处理空数据帧。
参考:How to check if spark dataframe is empty?
目标文件路径:
[filepath0@file101 ~]$ hadoop fs -ls /user/filepath0/pda/filepath1/tgt7
Found 16 items
drwxr-xr-x - filepath0 hdfs 0 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/_spark_metadata
-rw-r--r-- 3 filepath0 hdfs 0 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/part-00000-4b6a77d3-9821-4b61-befd-b49c448569ce.csv
-rw-r--r-- 3 filepath0 hdfs 677646 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00000-5d53fb08-2f42-4b5e-a3be-81d9c2ad4392.csv
-rw-r--r-- 3 filepath0 hdfs 19679 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00000-7bf19e7d-d174-4fac-a6c1-62c37017d151.csv
-rw-r--r-- 3 filepath0 hdfs 20349 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00000-7e820837-4b8d-4ce4-8643-c45e1f21f2b3.csv
-rw-r--r-- 3 filepath0 hdfs 0 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/part-00000-be52de60-3d39-473c-81ff-dae3aefc3a6d.csv
-rw-r--r-- 3 filepath0 hdfs 19679 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/part-00001-27d947ab-5ace-4abd-a17e-6af03080fdca.csv
-rw-r--r-- 3 filepath0 hdfs 0 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00001-43888a52-8a55-4fdd-bc79-faa3d43f2604.csv
-rw-r--r-- 3 filepath0 hdfs 0 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00001-4f0e7c26-2551-46e8-a721-9ab8b8b8d1c6.csv
-rw-r--r-- 3 filepath0 hdfs 657707 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00001-e8c294b4-6280-411e-bcc2-25b056c4e687.csv
-rw-r--r-- 3 filepath0 hdfs 19879 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/part-00001-f06b80e0-920e-4c63-891b-e9546f79e09f.csv
-rw-r--r-- 3 filepath0 hdfs 0 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00002-2813bb17-ac47-4811-a63d-4de90eb535b2.csv
-rw-r--r-- 3 filepath0 hdfs 0 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/part-00002-4219abc1-ff75-48b9-9b98-4a1ea4b9cd3d.csv
-rw-r--r-- 3 filepath0 hdfs 19879 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00002-76961eec-c125-4708-9816-c515ca8f5605.csv
-rw-r--r-- 3 filepath0 hdfs 20349 2019-11-13 16:53 /user/filepath0/pda/filepath1/tgt7/part-00002-c2de8fa4-7707-4518-a8eb-61dc2af089ee.csv
-rw-r--r-- 3 filepath0 hdfs 661457 2019-11-13 16:50 /user/filepath0/pda/filepath1/tgt7/part-00002-e383c97b-0ffc-438c-aef0-05b9ee401e7f.csv
Pyspark代码:
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"Clinet_id")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/home/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "password_rd") \
.option("kafka.sasl.kerberos.keytab","/home/path.keytab") \
.option("kafka.sasl.kerberos.principal","path") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
# Creating Writestream DataFrame :
if (df1.rdd.isEmpty()) == 'False':
df1.writeStream \
.option("path","/targetpath/") \
.format("csv") \
.option("checkpointLocation","/checkpointpath/") \
.outputMode("append") \
.start()
错误消息:
Traceback (most recent call last):
File "/home/bdpda/pda/pysparkstructurestreaming1.py", line 43, in <module>
if (df1.rdd.isEmpty()) == 'False':
File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 84, in rdd
File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: 'Queries with streaming sources must be executed with writeStream.start();;\nkafka'
19/11/15 14:12:37 INFO SparkContext: Invoking stop() from shutdown hook
19/11/15 14:12:37 INFO ServerConnector: Stopped Spark@70e4539a{HTTP/1.1}{0.0.0.0:4059}
19/11/15 14:12:37 INFO SparkUI: Stopped Spark web UI at http://10.220.3.167:4059