Spark结构化流媒体和过滤器

时间:2017-07-31 08:49:24

标签: spark-streaming

Spark 2.1,带有原始计数(*)的结构化流,sum(字段)在镶木地板文件上工作正常,但过滤不起作用。 示例代码:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0.2.6.0.3-8
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.types._

val userSchema = new StructType()
  .add("caseId", StringType)
  .add("ts", LongType)
  .add("rowtype", StringType)
  .add("rowordernumber", IntegerType)
  .add("parentrowordernumber", IntegerType)
  .add("fieldname", StringType)
  .add("valuestr", StringType)

val csvDF = spark.readStream.schema(userSchema).parquet("/folder1/folder2")

csvDF.createOrReplaceTempView("tmptable")
val aggDF = spark.sql("select count(*) from tmptable where rowtype='3600'")

aggDF
  .writeStream
  .outputMode("complete")
  .format("console")
  .start()

aggDF
.writeStream
.queryName("aggregates")    // this query name will be the table name
.outputMode("complete")
  .format("memory")
  .start()
spark.sql("select * from aggregates").show()


// Exiting paste mode, now interpreting.

+--------+
|count(1)|
+--------+
+--------+

import org.apache.spark.sql.types._
userSchema: org.apache.spark.sql.types.StructType = StructType(StructField(caseId,StringType,true), StructField(ts,LongType,true), StructField(rowtype,StringType,true), StructField(rowordernumber,IntegerType,true), StructField(parentrowordernumber,IntegerType,true), StructField(fieldname,StringType,true), StructField(valuestr,StringType,true))
csvDF: org.apache.spark.sql.DataFrame = [caseId: string, ts: bigint ... 5 more fields]
aggDF: org.apache.spark.sql.DataFrame = [count(1): bigint]

-------------------------------------------
Batch: 0
-------------------------------------------
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
+--------+
|count(1)|
+--------+
|       0|
+--------+

此外,我尝试过noSQL样式过滤: val aggDF = csvDF.filter(“rowtype =='3600'”)。agg(count(“caseId”))

没有成功,我检查了镶木地板文件,有一些行rowtype ='3600'

[root@sandbox ~]# spark-sql
SPARK_MAJOR_VERSION is set to 2, using Spark2
spark-sql> select count(*) from tab1 where rowtype='3600' ;
433698463

2 个答案:

答案 0 :(得分:5)

当数据是静态时,您无需指定自己的架构。在这种情况下,Spark可以自行确定镶木地板数据集的架构。 E.g:

case class Foo(lowercase: String, upperCase: String)
val df = spark.createDataset(List(Foo("abc","DEF"), Foo("ghi","JKL")))
df.write.parquet("/tmp/parquet-test")
val rdf = spark.read.parquet("/tmp/parquet-test")
rdf.printSchema
// root
//  |-- lowercase: string (nullable = true)
//  |-- upperCase: string (nullable = true)

在此阶段,后续SQL查询将无视该情况:

rdf.createOrReplaceTempView("rdf")
spark.sql("select uppercase from rdf").collect
// Array[org.apache.spark.sql.Row] = Array([DEF], [JKL])

Spark有一个选项spark.sql.caseSensitive来启用/禁用区分大小写(默认值为true),但它似乎只能在写入时使用。

尝试对流执行相同操作将导致异常:

java.lang.IllegalArgumentException: Schema must be specified when creating a streaming
  source DataFrame. If some files already exist in the directory, then depending
  on the file format you may be able to create a static DataFrame on that directory
  with 'spark.read.load(directory)' and infer schema from it.

这将为您提供以下选项:

  1. 按照您的方式提供自己的架构(请注意它虽然区分大小写)。
  2. 遵循异常中的建议,并从已存储在文件夹中的数据派生架构:
  3. val userSchema = spark.read.parquet("/tmp/parquet-test").schema
    val streamDf = spark.readStream.schema(userSchema).parquet("/tmp/parquet-test")
    
    1. 通过将spark.sql.streaming.schemaInference设置为true
    2. ,告诉Spark无论如何推断架构
      spark.sql("set spark.sql.streaming.schemaInference=true")
      val streamDf = spark.readStream.parquet("/tmp/parquet-test")
      streamDf.createOrReplaceTempView("stream_rdf")
      val query = spark.sql("select uppercase, count(*) from rdf group by uppercase")
        .writeStream
        .format("console")
        .outputMode("complete")
        .start
      

答案 1 :(得分:0)

问题出在rowtype列名称中,avro-parquet中的实际列名称为“rowType”。修复

.add("rowType", StringType)

解决了这个问题。