在
的开头有一个流媒体世界计数示例。
Structured
Streaming Programming Guide。
首先,我们执行
nc -lk 8888
在另一个终端中。 接下来,按照Python指南代码,在 example.py 中提供此代码:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
import sys
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
spark.sparkContext.setLogLevel("FATAL")
print("python version: "+sys.version)
print("spark version: "+str(spark.sparkContext.version))
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 8888) \
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
我们通过以下方式测试该应用程序:
spark-submit example.py
应用程序运行并等待套接字上的数据。我们在运行 netcat 的终端中多次键入单个单词,每次都带有回车符。
通过netcat发送数据后,应用程序每次都会失败。这是一些(片段)输出:
python version: 3.7.0 (default, Jul 23 2018, 20:22:55)
[Clang 9.1.0 (clang-902.0.39.2)]
spark version: 2.3.1
...
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o47.awaitTermination.
: org.apache.spark.sql.streaming.StreamingQueryException: null
=== Streaming Query ===
Identifier: [id = edbc0c22-2572-4036-82fd-b11afd030f26, runId = 16cbc842-3e20-4e43-9692-40ed09fd81e0]
Current Committed Offsets: {}
Current Available Offsets: {TextSocketSource[host: localhost, port: 8888]: 0}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
Aggregate [word#4], [word#4, count(1) AS count#8L]
+- Project [word#4]
+- Generate explode(split(value#1, )), false, [word#4]
+- StreamingExecutionRelation TextSocketSource[host: localhost, port: 8888], [value#1]
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: java.lang.IllegalArgumentException
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
...
奇怪的是,使用相同的Python版本(3.7.0)和相同的Spark版本(2.3.1),完全相同的示例适用于团队的其他成员。
有人看到过类似的行为吗?