Question

在
的开头有一个流媒体世界计数示例。 Structured Streaming Programming Guide。首先，我们执行

nc -lk 8888

在另一个终端中。接下来，按照Python指南代码，在 example.py 中提供此代码：

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
import sys

spark = SparkSession \
    .builder \
    .appName("StructuredNetworkWordCount") \
    .getOrCreate()
spark.sparkContext.setLogLevel("FATAL")

print("python version: "+sys.version)
print("spark version: "+str(spark.sparkContext.version))

 # Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 8888) \
    .load()

# Split the lines into words
words = lines.select(
   explode(
       split(lines.value, " ")
   ).alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()

 # Start running the query that prints the running counts to the console
query = wordCounts \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()

我们通过以下方式测试该应用程序：

spark-submit example.py

应用程序运行并等待套接字上的数据。我们在运行 netcat 的终端中多次键入单个单词，每次都带有回车符。

通过netcat发送数据后，应用程序每次都会失败。这是一些（片段）输出：

python version: 3.7.0 (default, Jul 23 2018, 20:22:55) 
[Clang 9.1.0 (clang-902.0.39.2)]
spark version: 2.3.1
...
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o47.awaitTermination.
: org.apache.spark.sql.streaming.StreamingQueryException: null
=== Streaming Query ===
Identifier: [id = edbc0c22-2572-4036-82fd-b11afd030f26, runId = 16cbc842-3e20-4e43-9692-40ed09fd81e0]
Current Committed Offsets: {}
Current Available Offsets: {TextSocketSource[host: localhost, port: 8888]: 0}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
Aggregate [word#4], [word#4, count(1) AS count#8L]
+- Project [word#4]
   +- Generate explode(split(value#1,  )), false, [word#4]
      +- StreamingExecutionRelation TextSocketSource[host: localhost, port: 8888], [value#1]

       at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
       at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: java.lang.IllegalArgumentException
       at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
       at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
       at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
       at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
       at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
       at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
       at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
       at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
       at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
       at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
       at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
       at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
       at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
       at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
...

奇怪的是，使用相同的Python版本（3.7.0）和相同的Spark版本（2.3.1），完全相同的示例适用于团队的其他成员。

有人看到过类似的行为吗？

Pyspark：最简单的结构化流示例的异常

0 个答案: