通过spark.read.json()加载时从JSON中删除一列

时间:2017-07-22 20:48:44

标签: hadoop apache-spark-sql

我陷入了一种非常奇怪的境地。我有一个文件,例如这三个JSON。

{"uploadTimeStamp":"1500618037189","ID":"123ID","data":[{"Data":{"unit":"rpm","value":"0"},"EventID":"E1","Timestamp":1500618037189,"pii":{}},{"Data":{"heading":"N","loc1":"false","loc2":"13.022425","loc3":"77.760587","loc4":"false","speed":"10"},"EventID":"E2","Timestamp":1500618037189,"pii":{}},{"Data":{"x":"1.1","y":"1.2","z":"2.2"},"EventID":"E3","Timestamp":1500618037189,"pii":{}},{"EventID":"E4","Data":{"value":"50","unit":"percentage"},"Timestamp":1500618037189},{"Data":{"unit":"kmph","value":"60"},"EventID":"E5","Timestamp":1500618037189,"pii":{}}]}
{"uploadTimeStamp":"1500618045735","ID":"123ID","data":[{"Data":{"unit":"rpm","value":"0"},"EventID":"E1","Timestamp":1500618045735,"pii":{}},{"Data":{"heading":"N","loc1":"false","loc2":"13.022425","loc3":"77.760587","loc4":"false","speed":"10"},"EventID":"E2","Timestamp":1500618045735,"pii":{}},{"Data":{"x":"1.1","y":"1.2","z":"2.2"},"EventID":"E3","Timestamp":1500618045735,"pii":{}},{"EventID":"E4","Data":{"value":"50","unit":"percentage"},"Timestamp":1500618045735},{"Data":{"unit":"kmph","value":"60"},"EventID":"E5","Timestamp":1500618045735,"pii":{}}]}
{"REGULAR_DUMMY":"REGULAR_DUMMY", "ID":"123ID", "uploadTimeStamp":1500546893837}

我正在使用spark-sql来加载这个json (spark.read.json)。然后,我使用df.createOrReplaceTempView("TEST")和“spark.sql("select count(*) from TEST)

”创建临时视图

我想要计算ID为123ID的所有记录,但我想忽略"REGULAR_DUMMY"行。这意味着,从上面的情况来看,计数(*)应该是2而不是3。

我尝试过以下方法来删除“REGULAR_DUMMY”列以获得计数。

1- val df = spark.read.json("hdfs://10.2.3.4/test/path/*).drop("REGULAR_DUMMY") - 它将模式显示为uploadTimeStamp,ID和数据,这是完全正常的。但它输出为3。

2 - df.drop("REGULAR_DUMMY").createOrReplaceTempView("TEST") - 这也将模式显示为uploadTimeStamp,ID和数据。但计数再次为3

3 - spark.sql("select count(*) from TEST).drop("REGULAR_DUMMY") - 再次计算为3

如果我这样做:

hadoop fs -cat /test/path/* | grep -i "123ID" | grep -v "REGULAR_DUMMY" | wc -l。那么数是2

和:

hadoop fs -cat /test/path/* | grep -i "123ID" | wc -l。计数是3

那么,我错过了什么?

1 个答案:

答案 0 :(得分:1)

.drop将删除整列,count将计算行数。由于您尚未使用" REGULAR_DUMMY"删除或过滤行。 REGULAR_DUMMY列中的值,计数为3是正确的。

您需要做的就是filter行" REGULAR_DUMMY" REGULAR_DUMMY列中的值,并将count作为

import org.apache.spark.sql.functions._
df.filter(col("REGULAR_DUMMY") =!= "REGULAR_DUMMY").select(count("*"))

这将返回0,因为REGULAR_DUMMY列的其余值为null。因此,REGULAR_DUMMY列中的所有行都没有值,所有行都会被过滤掉。

解决方法是将null值替换为临时值并执行filter,最后执行count

df.na.fill("temp").filter(col("REGULAR_DUMMY") =!= "REGULAR_DUMMY").select(count("*"))

应该显示正确的结果

+--------+
|count(1)|
+--------+
|2       |
+--------+

您也可以使用where过滤器

df.na.fill("temp").where(col("REGULAR_DUMMY") =!= "REGULAR_DUMMY").select(count("*"))