Question

我正在使用StreamingQueryListener将StreamingQueryProgress指标输出到Kafka，以便监控我的Spark结构化流媒体作业的性能。这些指标包括：

numInputRecords：触发器中处理的记录数
inputRowsPerSecond：到达的数据率
processedRowsPerSecond：Spark处理数据的速率
triggerExecution：处理此微批次的大致时间（以毫秒为单位）

根据Databricks博客Taking Apache Spark's Structured Streaming to Production，您可以通过将inputsRowsPerSecond与processedRowsPerSecond进行比较来监控性能：如果后者始终小于前者，则需要对部署进行缩放以跟上传入数据。< / p>

以下是使用单个执行器的翻滚窗口结构化流式计算的流式查询进度度量。我计算了processedRowsPerSecond和inputRowsPerSecond的平均值，并确认计算资源足以跟上查询。

timestamp     num input rows   proc r/s   input r/s    execution ms
04:38:40.577               8        1.0         1.1            7725
04:38:48.312            6408      829.6       828.4            7724
04:38:56.047               8        1.0         1.0            7955
04:39:04.011              10        1.1         1.3            9088
04:39:13.110            6408      797.7       704.3            8033
04:39:21.154               8        1.0         1.0            8111
04:39:29.275            3760      480.6       463.0            7824
04:39:37.108            2656      332.7       339.1            7983
04:39:45.102               0        0.0         0.0               3
04:39:45.491               8        1.1       615.4            7601
04:39:53.102            6400      815.0       840.9            7853
04:40:00.965               8        0.9         1.0            8883
04:40:09.858              13        1.6         1.5            7969
04:40:17.836            6408      810.2       803.2            7909
04:40:25.755               8        1.1         1.0            7456
04:40:33.220            6400      808.6       857.3            7915
04:40:41.145               8        1.1         1.0            7482
04:40:48.636              13        1.7         1.7            7523
04:40:56.168               8        1.1         1.1            7561
04:41:03.739               0        0.0         0.0               3
Progress summary   total_input_rows=38540.0
                   avg_processed_rates=744.7164839858403 
                   avg_input_rates=739.0657377541836 
                   total_execution_ms=142595.0

但是我认为在某些情况下这种技术不起作用。下面是一个简单的非窗口查询的结果，从输入主题读取并在0.5毫秒计算后写入输出主题。进度数据如下所示。在这种情况下，平均inputRowsPerSecond比平均的processedRowsPerSecond高5倍（即使在numInputRecords的加权间隔之后） - 但查询很容易在测试时间间隔内处理输入流中的所有数据记录。

timestamp     num input rows   proc r/s   input r/s    execution ms
01:58:41.456             326     1258.7     25076.9             259     
01:58:41.723            1745     1457.8      6535.6            1197     
01:58:42.929            4329     1235.8      3589.6            3503     
01:58:46.440               8       21.7         2.3             369     
01:58:46.817               0        0.0         0.0               3     
01:58:55.493               8      114.3       500.0              70     
01:58:55.572               0        0.0         0.0               3     
01:59:02.384              43      447.9      3307.7              96     
01:59:02.488             859     1403.6      8259.6             612     
01:59:03.108            4873     1153.6      7859.7            4224     
01:59:07.340             633      555.3       149.6            1140     
01:59:08.489               0        0.0         0.0             230     
01:59:15.490               8      121.2       666.7              66     
01:59:15.564               0        0.0         0.0               2     
01:59:23.217             136      925.2      9066.7             147     
01:59:23.372            1138     1493.4      7341.9             762     
01:59:24.143            5126      840.3      6648.5            6099     
01:59:30.250               8      135.6         1.3              59     
01:59:30.318               0        0.0         0.0               2     
01:59:35.485               4       67.8       333.3              59     
01:59:35.552               4       65.6        59.7              61     
01:59:35.622               0        0.0         0.0               2     
01:59:44.151              59      641.3      4916.7              92     
01:59:44.250             732     1402.3      7393.9             522     
01:59:44.780            4126     1605.4      7784.9            2570     
01:59:47.358            1491     1488.0       578.4            1002     
01:59:48.368               0        0.0         0.0               3     
01:59:55.489               8      131.1       666.7              61     
01:59:55.559               0        0.0         0.0               3     
02:00:05.008              77      445.1      5500.0             173     
02:00:05.190             994     1320.1      5461.5             753     
02:00:05.952            4533     1424.6      5948.8            3182     
02:00:09.142             804     1391.0       252.0             578     
02:00:09.728               0        0.0         0.0               2     
02:00:15.495               8       95.2       615.4              84     
02:00:15.588               0        0.0         0.0               3     
02:00:25.494               8       89.9       615.4              89     
02:00:25.592               0        0.0         0.0               4     
02:00:26.136               6       71.4       375.0              84     
02:00:26.228             718     1456.4      7804.3             493     
02:00:26.729            3843     1173.4      7670.7            3275     
02:00:30.013            1833     1229.4       558.2            1491     
02:00:31.512               0        0.0         0.0               3     
02:00:35.485               4       54.8       307.7              73     
02:00:35.567               4       60.6        48.8              66     
02:00:35.642               0        0.0         0.0               2     
02:00:45.494               8      101.3       666.7              79     
02:00:45.581               0        0.0         0.0               2     
02:00:55.494               8      112.7       615.4              71     
02:00:55.573               0        0.0         0.0               3     
Progress summary   total_input_rows=38512.0 
                   avg_processed_rates=1253.6837882613463 
                   avg_input_rates=6031.12344277613 
                   total_execution_ms=33461.0

根据这一进展数据，我确认：

processedRowsPerSecond = numInputRecords / triggerExecution * 1000 msec/sec
inputRowsPerSecond = numInputRecords / (current timestamp - previous timestamp) * 1000 msec/sec

只有当triggerExecution时间与进度记录时间戳之间的时间匹配，并且这些间隔的长度一致时，这两个速率才具有可比性。非窗口查询通常不会出现这种情况。

是否有推荐的方法来监控非窗口查询？

Answer 1

如果您确定代表您的活动时间的列，则此复杂情况（影响非窗口Spark结构化流媒体作业）似乎消失了，EG：

val inputData = dataSource.toDecodedDs(spark)
  .withWatermark("time", "60 seconds")

如何使用Spark StreamingQueryProgress准确监控性能？

1 个答案: