我陷入了一种非常奇怪的境地。我有一个文件,例如这三个JSON。
{"uploadTimeStamp":"1500618037189","ID":"123ID","data":[{"Data":{"unit":"rpm","value":"0"},"EventID":"E1","Timestamp":1500618037189,"pii":{}},{"Data":{"heading":"N","loc1":"false","loc2":"13.022425","loc3":"77.760587","loc4":"false","speed":"10"},"EventID":"E2","Timestamp":1500618037189,"pii":{}},{"Data":{"x":"1.1","y":"1.2","z":"2.2"},"EventID":"E3","Timestamp":1500618037189,"pii":{}},{"EventID":"E4","Data":{"value":"50","unit":"percentage"},"Timestamp":1500618037189},{"Data":{"unit":"kmph","value":"60"},"EventID":"E5","Timestamp":1500618037189,"pii":{}}]}
{"uploadTimeStamp":"1500618045735","ID":"123ID","data":[{"Data":{"unit":"rpm","value":"0"},"EventID":"E1","Timestamp":1500618045735,"pii":{}},{"Data":{"heading":"N","loc1":"false","loc2":"13.022425","loc3":"77.760587","loc4":"false","speed":"10"},"EventID":"E2","Timestamp":1500618045735,"pii":{}},{"Data":{"x":"1.1","y":"1.2","z":"2.2"},"EventID":"E3","Timestamp":1500618045735,"pii":{}},{"EventID":"E4","Data":{"value":"50","unit":"percentage"},"Timestamp":1500618045735},{"Data":{"unit":"kmph","value":"60"},"EventID":"E5","Timestamp":1500618045735,"pii":{}}]}
{"REGULAR_DUMMY":"REGULAR_DUMMY", "ID":"123ID", "uploadTimeStamp":1500546893837}
我正在使用spark-sql来加载这个json (spark.read.json)
。然后,我使用df.createOrReplaceTempView("TEST")
和“spark.sql("select count(*) from TEST)
我想要计算ID为123ID的所有记录,但我想忽略"REGULAR_DUMMY"
行。这意味着,从上面的情况来看,计数(*)应该是2而不是3。
我尝试过以下方法来删除“REGULAR_DUMMY”列以获得计数。
1- val df = spark.read.json("hdfs://10.2.3.4/test/path/*).drop("REGULAR_DUMMY")
- 它将模式显示为uploadTimeStamp,ID和数据,这是完全正常的。但它输出为3。
2 - df.drop("REGULAR_DUMMY").createOrReplaceTempView("TEST")
- 这也将模式显示为uploadTimeStamp,ID和数据。但计数再次为3
3 - spark.sql("select count(*) from TEST).drop("REGULAR_DUMMY")
- 再次计算为3
如果我这样做:
hadoop fs -cat /test/path/* | grep -i "123ID" | grep -v "REGULAR_DUMMY" | wc -l
。那么数是2
和:
hadoop fs -cat /test/path/* | grep -i "123ID" | wc -l
。计数是3
那么,我错过了什么?
答案 0 :(得分:1)
.drop
将删除整列,count
将计算行数。由于您尚未使用" REGULAR_DUMMY"删除或过滤行。 REGULAR_DUMMY
列中的值,计数为3是正确的。
您需要做的就是filter
行" REGULAR_DUMMY" REGULAR_DUMMY
列中的值,并将count
作为
import org.apache.spark.sql.functions._
df.filter(col("REGULAR_DUMMY") =!= "REGULAR_DUMMY").select(count("*"))
这将返回0
,因为REGULAR_DUMMY
列的其余值为null
。因此,REGULAR_DUMMY
列中的所有行都没有值,所有行都会被过滤掉。
解决方法是将null
值替换为临时值并执行filter
,最后执行count
df.na.fill("temp").filter(col("REGULAR_DUMMY") =!= "REGULAR_DUMMY").select(count("*"))
应该显示正确的结果
+--------+
|count(1)|
+--------+
|2 |
+--------+
您也可以使用where
过滤器
df.na.fill("temp").where(col("REGULAR_DUMMY") =!= "REGULAR_DUMMY").select(count("*"))