这是一个很长的问题,但是我试图详细描述这个问题。
我有一个基于PySpark和DStream API的Spark应用程序,可以从Kafka接收数据。但是,如Spark's documentation所示,Kafka支持已从最新版本的PySpark中删除了对DStream API的支持。因此,我试图将应用程序更改为使用结构化流。但是,当应用程序从Kafka接收数据时,我在创建数据框时遇到性能问题。
数据以CSV字符串行的形式发送到Kafka,应用程序负责接收这些行并应用(已知的)模式来处理数据。初始数据帧具有许多与kafka相关的列,消息数据位于here中的“值”列中。
场景1
在流数据帧处拆分列。这是理想的解决方案,因为它可以让我使用结构化流的内置功能,例如增量聚合,加水印,其他输出模式(例如附加等)。
示例代码:
from pyspark.sql import functions as F
def process_dataframe(df, batch_id):
df.cache()
count = df.count()
print('Number of rows: {0}'.format(count))
# continue df processing...
input_df = spark. \
readStream. \
format('kafka'). \
option('kafka.bootstrap.servers', '127.0.0.1:9092'). \
option('subscribe', 'some_topic_name'). \
load()
streaming_df = input_df.selectExpr("CAST(value AS STRING)")
streaming_df = streaming_df.withColumn('temp', F.split('value', ','))
streaming_df = streaming_df.drop('value')
field_names = [('field' + str(idx)) for idx in range(1, 45)]
streaming_df = streaming_df.\
select(*[F.col('temp').getItem(idx).alias('{0}'.format(column_name))
for idx, column_name in enumerate(field_names)])
streaming_df.printSchema()
query = streaming_df. \
writeStream. \
trigger(processingTime='300 seconds'). \
outputMode('update'). \
foreachBatch(process_dataframe). \
start()
query.awaitTermination()
与我以前的DStream API应用程序相比,此实现的性能结果不好。对于每5分钟(处理时间)有330万行,处理数据(仅用于拆分列并计算行)需要3-3.5分钟。
我试图找到造成这种情况的根本原因,最后我看了优化器生成的物理计划。最终查询似乎为传入数据的每一列调用了“ cast”和“ split”。这是实际计划:
== Physical Plan ==
*(1) Project [split(cast(value#746 as string), ,)[0] AS field1#27, split(cast(value#746 as string), ,)[1] AS field2#28, split(cast(value#746 as string), ,)[2] AS field3#29, split(cast(value#746 as string), ,)[3] AS field4#30, split(cast(value#746 as string), ,)[4] AS field5#31, split(cast(value#746 as string), ,)[5] AS field6#32, split(cast(value#746 as string), ,)[6] AS field7#33, split(cast(value#746 as string), ,)[7] AS field8#34, split(cast(value#746 as string), ,)[8] AS field9#35, split(cast(value#746 as string), ,)[9] AS field10#36, split(cast(value#746 as string), ,)[10] AS field11#37, split(cast(value#746 as string), ,)[11] AS field12#38, split(cast(value#746 as string), ,)[12] AS field13#39, split(cast(value#746 as string), ,)[13] AS field14#40, split(cast(value#746 as string), ,)[14] AS field15#41, split(cast(value#746 as string), ,)[15] AS field16#42, split(cast(value#746 as string), ,)[16] AS field17#43, split(cast(value#746 as string), ,)[17] AS field18#44, split(cast(value#746 as string), ,)[18] AS field19#45, split(cast(value#746 as string), ,)[19] AS field20#46, split(cast(value#746 as string), ,)[20] AS field21#47, split(cast(value#746 as string), ,)[21] AS field22#48, split(cast(value#746 as string), ,)[22] AS field23#49, split(cast(value#746 as string), ,)[23] AS field24#50, ... 20 more fields]
+- *(1) Project [key#745, value#746, topic#747, partition#748, offset#749L, timestamp#750, timestampType#751]
+- *(1) ScanV2 kafka[key#745, value#746, topic#747, partition#748, offset#749L, timestamp#750, timestampType#751] (Options: [subscribe=ip_flow_imsi_64_partitions_new_kafka_version,kafka.bootstrap.servers=127.0.0.1:90...)
为了进一步调查该问题,我实现了方案2中描述的第二个版本。
场景2
请勿在流数据框架处拆分列,而只需将传入行作为字符串(用逗号分隔),然后在(非流数据)数据框处拆分列即可。
示例代码:
from pyspark.sql import functions as F
def process_dataframe(df, batch_id):
df = df.withColumn('temp', F.split('value', ','))
df = df.drop('value')
df.cache()
df.count()
field_names = [('field' + str(idx)) for idx in range(1, 45)]
df = df.\
select(*[F.col('temp').getItem(idx).alias('{0}'.format(column_name))
for idx, column_name in enumerate(field_names)])
df.cache()
count = df.count()
print('Number of rows: {0}'.format(count))
# continue df processing...
input_df = spark. \
readStream. \
format('kafka'). \
option('kafka.bootstrap.servers', '127.0.0.1:9092'). \
option('subscribe', 'some_topic_name'). \
load()
streaming_df = input_df.selectExpr("CAST(value AS STRING)")
query = streaming_df. \
writeStream. \
trigger(processingTime='300 seconds'). \
outputMode('update'). \
foreachBatch(process_dataframe). \
start()
query.awaitTermination()
这里的窍门是我在调用split之后进行缓存,然后开始创建列。这种方法的性能要好得多。对于相同数量的行,在同一台计算机上,大约需要31秒来处理数据帧。这种方法的物理计划表明,拆分仅被调用一次,因为数据帧是在拆分后缓存的:
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#413L])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#636L])
+- InMemoryTableScan
+- InMemoryRelation [field1#55, field2#56, field3#57, field4#58, field5#59, field6#60, field7#61, field8#62, field9#63, field10#64, field11#65, field12#66, field13#67, field14#68, field15#69, field16#70, field17#71, field18#72, field19#73, field20#74, field21#75, field22#76, field23#77, field24#78, ... 20 more fields], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Project [temp#35[0] AS field1#55, temp#35[1] AS field2#56, temp#35[2] AS field3#57, temp#35[3] AS field4#58, temp#35[4] AS field5#59, temp#35[5] AS field6#60, temp#35[6] AS field7#61, temp#35[7] AS field8#62, temp#35[8] AS field9#63, temp#35[9] AS field10#64, temp#35[10] AS field11#65, temp#35[11] AS field12#66, temp#35[12] AS field13#67, temp#35[13] AS field14#68, temp#35[14] AS field15#69, temp#35[15] AS field16#70, temp#35[16] AS field17#71, temp#35[17] AS field18#72, temp#35[18] AS field19#73, temp#35[19] AS field20#74, temp#35[20] AS field21#75, temp#35[21] AS field22#76, temp#35[22] AS field23#77, temp#35[23] AS field24#78, ... 20 more fields]
+- InMemoryTableScan [temp#35]
+- InMemoryRelation [temp#35], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Project [split(value#34, ,) AS temp#35]
+- *(1) SerializeFromObject [if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, value), StringType), true, false) AS value#34]
+- Scan[obj#33]
所以,我的问题是,第一种方法我做错了什么?我的理解正确吗,造成这种性能差异的原因是多次使用强制转换和拆分?如果是这样,自从我cannot call cache or persist on a streaming dataframe以来,是否有一种方法可以强制优化器调用缓存并仅拆分一次?