我正在使用StreamingQueryListener将StreamingQueryProgress指标输出到Kafka,以便监控我的Spark结构化流媒体作业的性能。这些指标包括:
根据Databricks博客Taking Apache Spark's Structured Streaming to Production,您可以通过将inputsRowsPerSecond与processedRowsPerSecond进行比较来监控性能:如果后者始终小于前者,则需要对部署进行缩放以跟上传入数据。< / p>
以下是使用单个执行器的翻滚窗口结构化流式计算的流式查询进度度量。我计算了processedRowsPerSecond和inputRowsPerSecond的平均值,并确认计算资源足以跟上查询。
timestamp num input rows proc r/s input r/s execution ms
04:38:40.577 8 1.0 1.1 7725
04:38:48.312 6408 829.6 828.4 7724
04:38:56.047 8 1.0 1.0 7955
04:39:04.011 10 1.1 1.3 9088
04:39:13.110 6408 797.7 704.3 8033
04:39:21.154 8 1.0 1.0 8111
04:39:29.275 3760 480.6 463.0 7824
04:39:37.108 2656 332.7 339.1 7983
04:39:45.102 0 0.0 0.0 3
04:39:45.491 8 1.1 615.4 7601
04:39:53.102 6400 815.0 840.9 7853
04:40:00.965 8 0.9 1.0 8883
04:40:09.858 13 1.6 1.5 7969
04:40:17.836 6408 810.2 803.2 7909
04:40:25.755 8 1.1 1.0 7456
04:40:33.220 6400 808.6 857.3 7915
04:40:41.145 8 1.1 1.0 7482
04:40:48.636 13 1.7 1.7 7523
04:40:56.168 8 1.1 1.1 7561
04:41:03.739 0 0.0 0.0 3
Progress summary total_input_rows=38540.0
avg_processed_rates=744.7164839858403
avg_input_rates=739.0657377541836
total_execution_ms=142595.0
但是我认为在某些情况下这种技术不起作用。下面是一个简单的非窗口查询的结果,从输入主题读取并在0.5毫秒计算后写入输出主题。进度数据如下所示。在这种情况下,平均inputRowsPerSecond比平均的processedRowsPerSecond高5倍(即使在numInputRecords的加权间隔之后) - 但查询很容易在测试时间间隔内处理输入流中的所有数据记录。
timestamp num input rows proc r/s input r/s execution ms
01:58:41.456 326 1258.7 25076.9 259
01:58:41.723 1745 1457.8 6535.6 1197
01:58:42.929 4329 1235.8 3589.6 3503
01:58:46.440 8 21.7 2.3 369
01:58:46.817 0 0.0 0.0 3
01:58:55.493 8 114.3 500.0 70
01:58:55.572 0 0.0 0.0 3
01:59:02.384 43 447.9 3307.7 96
01:59:02.488 859 1403.6 8259.6 612
01:59:03.108 4873 1153.6 7859.7 4224
01:59:07.340 633 555.3 149.6 1140
01:59:08.489 0 0.0 0.0 230
01:59:15.490 8 121.2 666.7 66
01:59:15.564 0 0.0 0.0 2
01:59:23.217 136 925.2 9066.7 147
01:59:23.372 1138 1493.4 7341.9 762
01:59:24.143 5126 840.3 6648.5 6099
01:59:30.250 8 135.6 1.3 59
01:59:30.318 0 0.0 0.0 2
01:59:35.485 4 67.8 333.3 59
01:59:35.552 4 65.6 59.7 61
01:59:35.622 0 0.0 0.0 2
01:59:44.151 59 641.3 4916.7 92
01:59:44.250 732 1402.3 7393.9 522
01:59:44.780 4126 1605.4 7784.9 2570
01:59:47.358 1491 1488.0 578.4 1002
01:59:48.368 0 0.0 0.0 3
01:59:55.489 8 131.1 666.7 61
01:59:55.559 0 0.0 0.0 3
02:00:05.008 77 445.1 5500.0 173
02:00:05.190 994 1320.1 5461.5 753
02:00:05.952 4533 1424.6 5948.8 3182
02:00:09.142 804 1391.0 252.0 578
02:00:09.728 0 0.0 0.0 2
02:00:15.495 8 95.2 615.4 84
02:00:15.588 0 0.0 0.0 3
02:00:25.494 8 89.9 615.4 89
02:00:25.592 0 0.0 0.0 4
02:00:26.136 6 71.4 375.0 84
02:00:26.228 718 1456.4 7804.3 493
02:00:26.729 3843 1173.4 7670.7 3275
02:00:30.013 1833 1229.4 558.2 1491
02:00:31.512 0 0.0 0.0 3
02:00:35.485 4 54.8 307.7 73
02:00:35.567 4 60.6 48.8 66
02:00:35.642 0 0.0 0.0 2
02:00:45.494 8 101.3 666.7 79
02:00:45.581 0 0.0 0.0 2
02:00:55.494 8 112.7 615.4 71
02:00:55.573 0 0.0 0.0 3
Progress summary total_input_rows=38512.0
avg_processed_rates=1253.6837882613463
avg_input_rates=6031.12344277613
total_execution_ms=33461.0
根据这一进展数据,我确认:
processedRowsPerSecond = numInputRecords / triggerExecution * 1000 msec/sec
inputRowsPerSecond = numInputRecords / (current timestamp - previous timestamp) * 1000 msec/sec
只有当triggerExecution时间与进度记录时间戳之间的时间匹配,并且这些间隔的长度一致时,这两个速率才具有可比性。非窗口查询通常不会出现这种情况。
是否有推荐的方法来监控非窗口查询?
答案 0 :(得分:0)
如果您确定代表您的活动时间的列,则此复杂情况(影响非窗口Spark结构化流媒体作业)似乎消失了,EG:
val inputData = dataSource.toDecodedDs(spark)
.withWatermark("time", "60 seconds")