我正在设置一个Spark批处理,旨在过滤掉一些需要清理的字段。如何将所有行中有问题的值的列设置为“无”? (我已经有一个仅包含要更改的行的数据框)
我距离Spark专家还很远,在询问之前,我进行了很多搜索,但是我仍然很茫然,没有足够简单的答案。
大约有50列,我无法对列索引进行硬编码来访问它,因为将来的批次中它可能会更改。
id TARGET 1 TARGET 2 TARGET 3 Col6 ...
someid1 Some(String) Some(String) Some(String) val1 ...
someid2 Some(String) Some(String) None val4 ...
someid5 Some(String) Some(String) Some(String) val3 ...
someid6 Some(String) Some(String) Some(String) val7 ...
id TARGET 1 TARGET 2 TARGET 3 Col6 ...
someid1 None None None val1 ...
someid2 None None None val4 ...
someid5 None None None val3 ...
someid6 None None None val7 ...
答案 0 :(得分:0)
AFAIK,Spark不接受“无”值。一种可能的解决方案是将其替换为强制转换为String的空值:
WITH crm_mrdetails_bounds ( id, mr_name, mr_doctor, start_pos, end_pos ) AS (
SELECT id,
mr_name,
mr_doctor,
2,
INSTR( mr_doctor, ',', 2 )
FROM crm_mrdetails
UNION ALL
SELECT id,
mr_name,
mr_doctor,
end_pos + 1,
INSTR( mr_doctor, ',', end_pos + 1 )
FROM crm_mrdetails_bounds
WHERE end_pos > 0
),
crm_mrdetails_specs ( id, mr_name, start_pos, specialization_id ) AS (
SELECT id,
mr_name,
start_pos,
TO_NUMBER(
CASE end_pos
WHEN 0
THEN SUBSTR( mr_doctor, start_pos )
ELSE SUBSTR( mr_doctor, start_pos, end_pos - start_pos )
END
)
FROM crm_mrdetails_bounds
)
SELECT s.id,
MAX( s.mr_name ) AS mr_name,
LISTAGG( d.specialization, ',' )
WITHIN GROUP ( ORDER BY s.start_pos )
AS doctor_specialization
FROM crm_mrdetails_specs s
INNER JOIN crm_mr_doctor d
ON ( s.specialization_id = d.id )
GROUP BY s.id
它产生以下输出:
ds.
.withColumn("target1", lit(null).cast(StringType))
.withColumn("target2", lit(null).cast(StringType))
这也是在数据集中将值设置为+--------------------+-------+-------+-----+
| id|target1|target2| col6|
+--------------------+-------+-------+-----+
| 4201735573065099460| null| null|疦薠紀趣餅|
|-6432819446886055080| null| null|┵િ塇駢뱪|
|-7700306868339925800| null| null|鵎썢鳝踽嬌|
|-4913818084582557950| null| null|ꢵ痩찢쑖|
| 6731176796531697018| null| null|少⽬ᩢゖ謹|
+--------------------+-------+-------+-----+
only showing top 5 rows
root
|-- id: long (nullable = false)
|-- target1: string (nullable = true)
|-- target2: string (nullable = true)
|-- col6: string (nullable = true)
时得到的结果。
None
将返回:
case class TestData(id: Long, target1: Option[String], target2: Option[String], col6: String)
val res = Seq(
TestData(1, Some("a"), Some("b"), "c"),
TestData(2, Some("a"), Some("b"), "c"),
TestData(3, Some("a"), Some("b"), "c"),
TestData(4, Some("a"), Some("b"), "c")
).toDS()
res.show(5)
res.map(_.copy(target1 = None, target2 = None)).show(5)
res.printSchema()