过滤df的列之一,它是json

时间:2019-07-02 18:18:31

标签: scala apache-spark

我有如下DF:

|               value              |offset                     (these 2 are columns)

|{"Name":"myname","valid":"true"}  |  Guru

|{"Name":"myname1","valid","false"}|  Guru

我想根据值列的true或false来从中减去2 DF:

|               value              |offset
|{"Name":"myname","valid":"true"}  |  Guru
|               value              |offset
|{"Name":"myname1","valid","false"}|  Guru

1 个答案:

答案 0 :(得分:0)

get_json_object()用于处理包含JSON字符串的字段。参见https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@get_json_object(e:org.apache.spark.sql.Column,path:String):org.apache.spark.sql.Column

scala> val in = """value offset  partition       sourceSystem    sourceName      datePartition
     | {"Name":"myname","valid":"true"}  Guru    1       sda     sajka   ajsa
     | {"Name":"myname1","valid":"false"}        Guru    1       sda     sajka   ajsa"""
in: String =
value   offset  partition   sourceSystem    sourceName  datePartition
{"Name":"myname","valid":"true"}    Guru    1   sda sajka   ajsa
{"Name":"myname1","valid":"false"}  Guru    1   sda sajka   ajsa

scala> val df = spark.read.option("header", true).option("sep", "\t").csv(in.split("\n").toSeq.toDS)
df: org.apache.spark.sql.DataFrame = [value: string, offset: string ... 4 more fields]

scala> df.where(get_json_object('value, "$.valid") === "true").show
+--------------------+------+---------+------------+----------+-------------+
|               value|offset|partition|sourceSystem|sourceName|datePartition|
+--------------------+------+---------+------------+----------+-------------+
|{"Name":"myname",...|  Guru|        1|         sda|     sajka|         ajsa|
+--------------------+------+---------+------------+----------+-------------+