Question

这是一个很长的问题，但是我试图详细描述这个问题。

我有一个基于PySpark和DStream API的Spark应用程序，可以从Kafka接收数据。但是，如Spark's documentation所示，Kafka支持已从最新版本的PySpark中删除了对DStream API的支持。因此，我试图将应用程序更改为使用结构化流。但是，当应用程序从Kafka接收数据时，我在创建数据框时遇到性能问题。

数据以CSV字符串行的形式发送到Kafka，应用程序负责接收这些行并应用（已知的）模式来处理数据。初始数据帧具有许多与kafka相关的列，消息数据位于here中的“值”列中。

场景1

在流数据帧处拆分列。这是理想的解决方案，因为它可以让我使用结构化流的内置功能，例如增量聚合，加水印，其他输出模式（例如附加等）。

示例代码：

from pyspark.sql import functions as F

def process_dataframe(df, batch_id):
  df.cache()
  count = df.count()
  print('Number of rows: {0}'.format(count))
  # continue df processing...

input_df = spark. \
    readStream. \
    format('kafka'). \
    option('kafka.bootstrap.servers', '127.0.0.1:9092'). \
    option('subscribe', 'some_topic_name'). \
    load()

streaming_df = input_df.selectExpr("CAST(value AS STRING)")
streaming_df = streaming_df.withColumn('temp', F.split('value', ','))
streaming_df = streaming_df.drop('value')

field_names = [('field' + str(idx)) for idx in range(1, 45)]
streaming_df = streaming_df.\
    select(*[F.col('temp').getItem(idx).alias('{0}'.format(column_name))
             for idx, column_name in enumerate(field_names)])
streaming_df.printSchema()
query = streaming_df. \
    writeStream. \
    trigger(processingTime='300 seconds'). \
    outputMode('update'). \
    foreachBatch(process_dataframe). \
    start()

query.awaitTermination()

与我以前的DStream API应用程序相比，此实现的性能结果不好。对于每5分钟（处理时间）有330万行，处理数据（仅用于拆分列并计算行）需要3-3.5分钟。

我试图找到造成这种情况的根本原因，最后我看了优化器生成的物理计划。最终查询似乎为传入数据的每一列调用了“ cast”和“ split”。这是实际计划：

== Physical Plan ==
*(1) Project [split(cast(value#746 as string), ,)[0] AS field1#27, split(cast(value#746 as string), ,)[1] AS field2#28, split(cast(value#746 as string), ,)[2] AS field3#29, split(cast(value#746 as string), ,)[3] AS field4#30, split(cast(value#746 as string), ,)[4] AS field5#31, split(cast(value#746 as string), ,)[5] AS field6#32, split(cast(value#746 as string), ,)[6] AS field7#33, split(cast(value#746 as string), ,)[7] AS field8#34, split(cast(value#746 as string), ,)[8] AS field9#35, split(cast(value#746 as string), ,)[9] AS field10#36, split(cast(value#746 as string), ,)[10] AS field11#37, split(cast(value#746 as string), ,)[11] AS field12#38, split(cast(value#746 as string), ,)[12] AS field13#39, split(cast(value#746 as string), ,)[13] AS field14#40, split(cast(value#746 as string), ,)[14] AS field15#41, split(cast(value#746 as string), ,)[15] AS field16#42, split(cast(value#746 as string), ,)[16] AS field17#43, split(cast(value#746 as string), ,)[17] AS field18#44, split(cast(value#746 as string), ,)[18] AS field19#45, split(cast(value#746 as string), ,)[19] AS field20#46, split(cast(value#746 as string), ,)[20] AS field21#47, split(cast(value#746 as string), ,)[21] AS field22#48, split(cast(value#746 as string), ,)[22] AS field23#49, split(cast(value#746 as string), ,)[23] AS field24#50, ... 20 more fields]
+- *(1) Project [key#745, value#746, topic#747, partition#748, offset#749L, timestamp#750, timestampType#751]
   +- *(1) ScanV2 kafka[key#745, value#746, topic#747, partition#748, offset#749L, timestamp#750, timestampType#751] (Options: [subscribe=ip_flow_imsi_64_partitions_new_kafka_version,kafka.bootstrap.servers=127.0.0.1:90...)

为了进一步调查该问题，我实现了方案2中描述的第二个版本。

场景2

请勿在流数据框架处拆分列，而只需将传入行作为字符串（用逗号分隔），然后在（非流数据）数据框处拆分列即可。

示例代码：

from pyspark.sql import functions as F

def process_dataframe(df, batch_id):
  df = df.withColumn('temp', F.split('value', ','))
  df = df.drop('value')
  df.cache()
  df.count()
  field_names = [('field' + str(idx)) for idx in range(1, 45)]
  df = df.\
    select(*[F.col('temp').getItem(idx).alias('{0}'.format(column_name))
             for idx, column_name in enumerate(field_names)])
  df.cache()
  count = df.count()
  print('Number of rows: {0}'.format(count))
  # continue df processing...

input_df = spark. \
    readStream. \
    format('kafka'). \
    option('kafka.bootstrap.servers', '127.0.0.1:9092'). \
    option('subscribe', 'some_topic_name'). \
    load()

streaming_df = input_df.selectExpr("CAST(value AS STRING)")

query = streaming_df. \
    writeStream. \
    trigger(processingTime='300 seconds'). \
    outputMode('update'). \
    foreachBatch(process_dataframe). \
    start()

query.awaitTermination()

这里的窍门是我在调用split之后进行缓存，然后开始创建列。这种方法的性能要好得多。对于相同数量的行，在同一台计算机上，大约需要31秒来处理数据帧。这种方法的物理计划表明，拆分仅被调用一次，因为数据帧是在拆分后缓存的：

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#413L])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#636L])
      +- InMemoryTableScan
            +- InMemoryRelation [field1#55, field2#56, field3#57, field4#58, field5#59, field6#60, field7#61, field8#62, field9#63, field10#64, field11#65, field12#66, field13#67, field14#68, field15#69, field16#70, field17#71, field18#72, field19#73, field20#74, field21#75, field22#76, field23#77, field24#78, ... 20 more fields], StorageLevel(disk, memory, deserialized, 1 replicas)
                  +- *(1) Project [temp#35[0] AS field1#55, temp#35[1] AS field2#56, temp#35[2] AS field3#57, temp#35[3] AS field4#58, temp#35[4] AS field5#59, temp#35[5] AS field6#60, temp#35[6] AS field7#61, temp#35[7] AS field8#62, temp#35[8] AS field9#63, temp#35[9] AS field10#64, temp#35[10] AS field11#65, temp#35[11] AS field12#66, temp#35[12] AS field13#67, temp#35[13] AS field14#68, temp#35[14] AS field15#69, temp#35[15] AS field16#70, temp#35[16] AS field17#71, temp#35[17] AS field18#72, temp#35[18] AS field19#73, temp#35[19] AS field20#74, temp#35[20] AS field21#75, temp#35[21] AS field22#76, temp#35[22] AS field23#77, temp#35[23] AS field24#78, ... 20 more fields]
                     +- InMemoryTableScan [temp#35]
                           +- InMemoryRelation [temp#35], StorageLevel(disk, memory, deserialized, 1 replicas)
                                 +- *(1) Project [split(value#34, ,) AS temp#35]
                                    +- *(1) SerializeFromObject [if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, value), StringType), true, false) AS value#34]
                                       +- Scan[obj#33]

所以，我的问题是，第一种方法我做错了什么？我的理解正确吗，造成这种性能差异的原因是多次使用强制转换和拆分？如果是这样，自从我cannot call cache or persist on a streaming dataframe以来，是否有一种方法可以强制优化器调用缓存并仅拆分一次？

使用Kafka发送CSV行的PySpark结构化流太慢了吗？

0 个答案: