我尝试使用PySpark和Structured Streaming(Spark 2.3)在两个Kafka Stream之间进行左外连接。
import os
import time
from pyspark.sql.types import *
from pyspark.sql.functions import from_json, col, struct, explode, get_json_object
from ast import literal_eval
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 pyspark-shell'
spark = SparkSession \
.builder \
.appName("Spark Kafka Structured Streaming") \
.getOrCreate()
schema_impressions = StructType() \
.add("id_req", StringType()) \
.add("ts_imp_request", TimestampType()) \
.add("country", StringType()) \
.add("TS_IMPRESSION", TimestampType())
schema_requests = StructType() \
.add("id_req", StringType()) \
.add("page", StringType()) \
.add("conntype", StringType()) \
.add("TS_REQUEST", TimestampType())
impressions = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "ip-ec2.internal:9092") \
.option("subscribe", "ssp.datascience_impressions") \
.load()
requests = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "ip-ec2.internal:9092") \
.option("subscribe", "ssp.datascience_requests") \
.option("startingOffsets", "latest") \
.load()
query_requests = requests \
.select(col("timestamp"), col("key").cast("string"), from_json(col("value").cast("string"), schema_requests).alias("parsed")) \
.select(col("timestamp").alias("timestamp_req"), "parsed.id_req", "parsed.page", "parsed.conntype", "parsed.TS_REQUEST") \
.withWatermark("timestamp_req", "120 seconds")
query_impressions = impressions \
.select(col("timestamp"), col("key").cast("string"), from_json(col("value").cast("string"), schema_impressions).alias("parsed")) \
.select(col("timestamp").alias("timestamp_imp"), col("parsed.id_req").alias("id_imp"), "parsed.ts_imp_request", "parsed.country", "parsed.TS_IMPRESSION") \
.withWatermark("timestamp_imp", "120 seconds")
query_requests.printSchema()
query_impressions.printSchema()
> root
|-- timestamp_req: timestamp (nullable = true)
|-- id_req: string (nullable = true)
|-- page: string (nullable = true)
|-- conntype: string (nullable = true)
|-- TS_REQUEST: timestamp (nullable = true)
>
> root |-- timestamp_imp: timestamp (nullable = true)
|-- id_imp: string (nullable = true)
|-- ts_imp_request: timestamp (nullable = true)
|-- country: string (nullable = true)
|-- TS_IMPRESSION: timestamp (nullable = true)
在简历中,我将从两个Kafka Streams获取数据,在接下来的行中,我将尝试使用ID进行连接。
rawQuery = query_requests.join(query_impressions, expr("""
(id_req = id_imp AND
timestamp_imp >= timestamp_req AND
timestamp_imp <= timestamp_req + interval 5 minutes)
"""),
"leftOuter")
rawQuery = rawQuery \
.writeStream \
.format("parquet") \
.option("checkpointLocation", "/home/jovyan/streaming/applicationHistory") \
.option("path", "/home/jovyan/streaming").start()
print(rawQuery.status)
{&#39; message&#39;:&#39;处理新数据&#39;,&#39; isDataAvailable&#39;:是的, &#39; isTriggerActive&#39;:True}错误:root:发送命令时出现异常。 Traceback(最近一次调用最后一次):文件 &#34; /opt/conda/lib/python3.6/site-packages/py4j/java_gateway.py" ;, line 1062,在send_command中 引发Py4JNetworkError(&#34; Java方面的答案是空的&#34;)py4j.protocol.Py4JNetworkError:来自Java端的答案是空的
在处理上述异常期间,发生了另一个异常:
Traceback(最近一次调用最后一次):文件 &#34; /opt/conda/lib/python3.6/site-packages/py4j/java_gateway.py" ;, line 908,在send_command中 response = connection.send_command(command)File&#34; /opt/conda/lib/python3.6/site-packages/py4j/java_gateway.py" ;, line 1067,在send_command中 &#34;接收&#34;时出错,e,proto.ERROR_ON_RECEIVE)py4j.protocol.Py4JNetworkError:接收时出错 错误:py4j.java_gateway:尝试连接时发生错误 Trace server(127.0.0.1:33968)Traceback(最近一次调用最后一次):
文件 &#34; /opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py" ;, 第2910行,在run_code中 exec(code_obj,self.user_global_ns,self.user_ns)文件&#34;&#34;,第3行,in print(rawQuery.status)File&#34; /opt/conda/lib/python3.6/site-packages/pyspark/sql/streaming.py", 第114行,处于状态 返回json.loads(self._jsq.status()JSON()。)文件&#34; /opt/conda/lib/python3.6/site-packages/py4j/java_gateway.py" ;,线 1160,在通话 回答,self.gateway_client,self.target_id,self.name)文件&#34; /opt/conda/lib/python3.6/site-packages/pyspark/sql/utils.py" ;, line 63,装饰 return f(* a,** kw)File&#34; /opt/conda/lib/python3.6/site-packages/py4j/protocol.py" ;,第328行, 在get_return_value中 format(target_id,&#34;。&#34;,name))py4j.protocol.Py4JError:调用o92.status时发生错误在处理上述异常期间,发生了另一个异常:
Traceback(最近一次调用最后一次):文件 &#34; /opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py" ;, 1828年,在showtraceback stb = value._render_traceback_()AttributeError:&#39; Py4JError&#39;对象没有属性&#39; _render_traceback _&#39;
在处理上述异常期间,发生了另一个异常:
Traceback(最近一次调用最后一次):文件 &#34; /opt/conda/lib/python3.6/site-packages/py4j/java_gateway.py" ;, line 852,在_get_connection中 connection = self.deque.pop()IndexError:从空deque中弹出
我使用Jupyter Notebook在本地运行Spark。在spark / conf / spark-defaults.conf 中,我有:
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 15g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
如果我在上一次错误后尝试使用Spark,则收到该错误:
错误:root:发送命令时出现异常。追溯(最近的 最后打电话):文件 &#34; /opt/conda/lib/python3.6/site-packages/py4j/java_gateway.py" ;, line 1062,在send_command中 引发Py4JNetworkError(&#34; Java方面的答案是空的&#34;)py4j.protocol.Py4JNetworkError:来自Java端的答案是空的
在处理上述异常期间,发生了另一个异常:
Traceback(最近一次调用最后一次):文件 &#34; /opt/conda/lib/python3.6/site-packages/py4j/java_gateway.py" ;, line 908,在send_command中 response = connection.send_command(command)File&#34; /opt/conda/lib/python3.6/site-packages/py4j/java_gateway.py" ;, line 1067,在send_command中 &#34;接收&#34;时出错,e,proto.ERROR_ON_RECEIVE)py4j.protocol.Py4JNetworkError:接收时出错
答案 0 :(得分:1)
我解决了这个问题!基本上,由于某些原因,这个问题与Jupyter Notebook有关。我删除了上一个代码的下一行:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 pyspark-shell'
我使用控制台运行代码:
> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 spark_structured.py
通过这种方式,我可以毫无问题地运行所有代码。
如果您遇到同样的问题,也可以更改 spark-default.conf 并增加 spark.driver.memory 和 spark.executor.memory