SPARK:过滤或Where在应用于DataFrame.LIMIT()时返回额外的结果

时间:2016-11-13 21:03:07

标签: scala apache-spark spark-dataframe

我有一个1000多行架构的Dataframe(df1):

+--------+-------------+-------------------+
|      id|       pro_id|           datetime|
+--------+-------------+-------------------+
|11304569|      8195360|2015-01-23 15:21:51|
|11334963|      8060212|2015-01-28 22:49:17|
+--------+-------------+-------------------+ 

申请后:

val df2 = df1.limit(10)
println(" no of df2 : "+ df2.count())
df2.show()

给出结果:

 no of df2 : 10
+--------+-------------+-------------------+
|      id|       pro_id|           datetime|
+--------+-------------+-------------------+
|11304569|      8195360|2015-01-23 15:21:51|
|11334963|      8060212|2015-01-28 22:49:17|
|11334963|      8060212|2015-01-28 22:49:17|
|11334963|      8060212|2015-01-28 23:20:43|
|11304569|      8143638|2015-02-03 14:34:48|
|11336154|      8060212|2015-02-03 19:25:24|
|11304569|      8173052|2015-02-05 08:15:12|
|11398902|      8173052|2015-02-05 08:18:50|
|11349129|      8097653|2015-02-05 08:29:33|
|11349129|      8027845|2015-02-05 08:29:33|
+--------+-------------+-------------------+

然后应用过滤函数

val v = df2.filter($"datetime" >= "2015-02-03 00:00:00")
println(" no of v : "+ v.count())
v.show()

这应该只给我最后6行,而是 它给了:

 no of v : 10
+--------+-------------+-------------------+
|      id|       pro_id|           datetime|
+--------+-------------+-------------------+
|11304569|      8143638|2015-02-03 14:34:48|
|11336154|      8060212|2015-02-03 19:25:24|
|11304569|      8173052|2015-02-05 08:15:12|
|11398902|      8173052|2015-02-05 08:18:50|
|11349129|      8097653|2015-02-05 08:29:33|
|11349129|      8027845|2015-02-05 08:29:33|
|11349129|      8105806|2015-02-05 08:29:33|
|11349187|      8197725|2015-02-05 09:00:32|
|11349188|      8134473|2015-02-05 08:01:50|
|11349187|      8132574|2015-02-05 09:09:07|
+--------+-------------+-------------------+

当我甚至没有连接df1时,如何从原始df1获得额外的4行?

“LIMIT”功能是否以其他方式工作?

0 个答案:

没有答案