Spark 2.1,带有原始计数(*)的结构化流,sum(字段)在镶木地板文件上工作正常,但过滤不起作用。 示例代码:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.2.6.0.3-8
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.types._
val userSchema = new StructType()
.add("caseId", StringType)
.add("ts", LongType)
.add("rowtype", StringType)
.add("rowordernumber", IntegerType)
.add("parentrowordernumber", IntegerType)
.add("fieldname", StringType)
.add("valuestr", StringType)
val csvDF = spark.readStream.schema(userSchema).parquet("/folder1/folder2")
csvDF.createOrReplaceTempView("tmptable")
val aggDF = spark.sql("select count(*) from tmptable where rowtype='3600'")
aggDF
.writeStream
.outputMode("complete")
.format("console")
.start()
aggDF
.writeStream
.queryName("aggregates") // this query name will be the table name
.outputMode("complete")
.format("memory")
.start()
spark.sql("select * from aggregates").show()
// Exiting paste mode, now interpreting.
+--------+
|count(1)|
+--------+
+--------+
import org.apache.spark.sql.types._
userSchema: org.apache.spark.sql.types.StructType = StructType(StructField(caseId,StringType,true), StructField(ts,LongType,true), StructField(rowtype,StringType,true), StructField(rowordernumber,IntegerType,true), StructField(parentrowordernumber,IntegerType,true), StructField(fieldname,StringType,true), StructField(valuestr,StringType,true))
csvDF: org.apache.spark.sql.DataFrame = [caseId: string, ts: bigint ... 5 more fields]
aggDF: org.apache.spark.sql.DataFrame = [count(1): bigint]
-------------------------------------------
Batch: 0
-------------------------------------------
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
+--------+
|count(1)|
+--------+
| 0|
+--------+
此外,我尝试过noSQL样式过滤: val aggDF = csvDF.filter(“rowtype =='3600'”)。agg(count(“caseId”))
没有成功,我检查了镶木地板文件,有一些行rowtype ='3600'
[root@sandbox ~]# spark-sql
SPARK_MAJOR_VERSION is set to 2, using Spark2
spark-sql> select count(*) from tab1 where rowtype='3600' ;
433698463
答案 0 :(得分:5)
当数据是静态时,您无需指定自己的架构。在这种情况下,Spark可以自行确定镶木地板数据集的架构。 E.g:
case class Foo(lowercase: String, upperCase: String)
val df = spark.createDataset(List(Foo("abc","DEF"), Foo("ghi","JKL")))
df.write.parquet("/tmp/parquet-test")
val rdf = spark.read.parquet("/tmp/parquet-test")
rdf.printSchema
// root
// |-- lowercase: string (nullable = true)
// |-- upperCase: string (nullable = true)
在此阶段,后续SQL查询将无视该情况:
rdf.createOrReplaceTempView("rdf")
spark.sql("select uppercase from rdf").collect
// Array[org.apache.spark.sql.Row] = Array([DEF], [JKL])
Spark有一个选项spark.sql.caseSensitive
来启用/禁用区分大小写(默认值为true
),但它似乎只能在写入时使用。
尝试对流执行相同操作将导致异常:
java.lang.IllegalArgumentException: Schema must be specified when creating a streaming
source DataFrame. If some files already exist in the directory, then depending
on the file format you may be able to create a static DataFrame on that directory
with 'spark.read.load(directory)' and infer schema from it.
这将为您提供以下选项:
val userSchema = spark.read.parquet("/tmp/parquet-test").schema
val streamDf = spark.readStream.schema(userSchema).parquet("/tmp/parquet-test")
spark.sql.streaming.schemaInference
设置为true
spark.sql("set spark.sql.streaming.schemaInference=true")
val streamDf = spark.readStream.parquet("/tmp/parquet-test")
streamDf.createOrReplaceTempView("stream_rdf")
val query = spark.sql("select uppercase, count(*) from rdf group by uppercase")
.writeStream
.format("console")
.outputMode("complete")
.start
答案 1 :(得分:0)
问题出在rowtype列名称中,avro-parquet中的实际列名称为“rowType”。修复
.add("rowType", StringType)
解决了这个问题。