将多列的所有值都设置为“无”

时间:2019-10-28 11:36:07

标签: scala dataframe apache-spark

我正在设置一个Spark批处理,旨在过滤掉一些需要清理的字段。如何将所有行中有问题的值的列设置为“无”? (我已经有一个仅包含要更改的行的数据框)

我距离Spark专家还很远,在询问之前,我进行了很多搜索,但是我仍然很茫然,没有足够简单的答案。

大约有50列,我无法对列索引进行硬编码来访问它,因为将来的批次中它可能会更改。

示例输入数据框(目标列包含数据):

id        TARGET 1       TARGET 2       TARGET 3     Col6     ...
someid1   Some(String)   Some(String)   Some(String) val1     ...
someid2   Some(String)   Some(String)   None         val4     ...
someid5   Some(String)   Some(String)   Some(String) val3     ...
someid6   Some(String)   Some(String)   Some(String) val7     ... 

预期的输出数据框(所有目标列均设置为“无”):

id        TARGET 1       TARGET 2       TARGET 3     Col6     ...
someid1   None           None           None         val1     ...
someid2   None           None           None         val4     ...
someid5   None           None           None         val3     ...
someid6   None           None           None         val7     ...

1 个答案:

答案 0 :(得分:0)

AFAIK,Spark不接受“无”值。一种可能的解决方案是将其替换为强制转换为String的空值:

WITH crm_mrdetails_bounds ( id, mr_name, mr_doctor, start_pos, end_pos ) AS (
  SELECT id,
         mr_name,
         mr_doctor,
         2,
         INSTR( mr_doctor, ',', 2 )
  FROM   crm_mrdetails
UNION ALL
  SELECT id,
         mr_name,
         mr_doctor,
         end_pos + 1,
         INSTR( mr_doctor, ',', end_pos + 1 )
  FROM   crm_mrdetails_bounds
  WHERE  end_pos > 0
),
crm_mrdetails_specs ( id, mr_name, start_pos, specialization_id ) AS (
  SELECT id,
         mr_name,
         start_pos,
         TO_NUMBER(
           CASE end_pos
           WHEN 0
           THEN SUBSTR( mr_doctor, start_pos )
           ELSE SUBSTR( mr_doctor, start_pos, end_pos - start_pos )
           END
         )
  FROM   crm_mrdetails_bounds
)
SELECT s.id,
       MAX( s.mr_name ) AS mr_name,
       LISTAGG( d.specialization, ',' )
         WITHIN GROUP ( ORDER BY s.start_pos )
         AS doctor_specialization
FROM   crm_mrdetails_specs s
       INNER JOIN crm_mr_doctor d
       ON ( s.specialization_id = d.id )
GROUP BY s.id

它产生以下输出:

ds.
  .withColumn("target1", lit(null).cast(StringType))
  .withColumn("target2", lit(null).cast(StringType))

这也是在数据集中将值设置为+--------------------+-------+-------+-----+ | id|target1|target2| col6| +--------------------+-------+-------+-----+ | 4201735573065099460| null| null|疦薠紀趣餅| |-6432819446886055080| null| null|┵િ塇駢뱪| |-7700306868339925800| null| null|鵎썢鳝踽嬌| |-4913818084582557950| null| null|ꢵ痩찢쑖| | 6731176796531697018| null| null|少⽬ᩢゖ謹| +--------------------+-------+-------+-----+ only showing top 5 rows root |-- id: long (nullable = false) |-- target1: string (nullable = true) |-- target2: string (nullable = true) |-- col6: string (nullable = true) 时得到的结果。

None

将返回:

case class TestData(id: Long, target1: Option[String], target2: Option[String], col6: String)

val res = Seq(
  TestData(1, Some("a"), Some("b"), "c"),
  TestData(2, Some("a"), Some("b"), "c"),
  TestData(3, Some("a"), Some("b"), "c"),
  TestData(4, Some("a"), Some("b"), "c")
).toDS()

res.show(5)
res.map(_.copy(target1 = None, target2 = None)).show(5)
res.printSchema()