Question

每当我尝试转换它时，这是我得到的例外。

val df_col = df.select("ts.user.friends_count").collect.map(_.toSeq)
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;

我要做的就是在结构化流媒体中复制以下sql.dataframe操作。

df.collect().foreach(row => droolsCaseClass(row.getLong(0), row.getString(1)))

在Dataframes中运行良好，但在结构化流媒体中运行不正确。

Answer 1

collect即使在Spark Core的RDD世界中也是一个很大的禁忌，因为您可以将数据大小传输回驱动程序的单个JVM。它只是设置了Spark的好处的边界，因为collect在一个JVM之后。

据说，考虑一下永不终止的无界数据，即数据流。那是Spark Sparkured Streaming。

流数据集是一个永远不会完成的数据集，每次请求内容时内部数据都会变化，即通过数据流执行结构化查询的结果。

你根本不能说＆＃34;嘿，给我数据是流数据集的内容＆＃34;。这甚至没有意义。

这就是为什么你不能在流数据集上collect的原因。它不可能达到Spark 2.2.1（撰写本文时的最新版本）。

如果您希望在一段时间内接收流式数据集中的数据（在Spark Streaming中称为批处理间隔，或在Spark Structured Streaming中触发器），将结果写入流式接收器，例如console。

您还可以在collect.map(_.toSeq) addBatch内编写console自定义流式接收器。事实上，df.collect().foreach(row => droolsCaseClass(row.getLong(0), row.getString(1)))下沉the main and only method of a streaming sink。

我要做的就是复制以下sql.dataframe 结构化流媒体中的操作。
foreach
在Dataframes中运行良好，但在结构化流媒体中运行不正确。

我想到的第一个解决方案是使用does exactly it：

use warnings; use strict; use feature 'say'; my ($file1, $file2) = @ARGV; die "Usage: $0 file1 file2\n" if !$file1 or !$file2; open my $fh1, '<', $file1 or die "Can't open $file1: $!"; open my $fh2, '<', $file2 or die "Can't open $file2: $!"; # Second file's empty marker means it reads it from the beginning my ($re_marker1, $re_marker2) = (qr/^##/, qr//); while (<$fh1>) { last if /$re_marker1/ }; while (<$fh2>) { last if /$re_marker2/ }; while (1) { my $l1 = <$fh1>; my $l2 = <$fh2>; chomp ($l1, $l2); say "$l1 | $l2"; last if eof $fh1 or eof $fh2; }操作允许对输出数据计算任意操作。

当然，不意味着这是最好的解决方案。只是我想到的一个。

如何在流数据集上执行df.rdd或df.collect（）。foreach？

1 个答案: