我在数据集中有一列,需要分成多列。
这里是contextMap_ID1
列的示例,这就是我要查找的结果。
此代码将创建示例(列contextMap_ID1
)和结果(除第二列之外的其他列)。第二列说明了我期望的逻辑。
dfx = sc.parallelize([
("blah blah blah createdTimeStamp=2020-08-11 15:31:37.458 blah blah blah","contains the word 'TimeStamp' >> do not process","","","")
,(123456789,"NUMERIC 9 digit Number >> caseId",123456789,"","")
,("caseId: 2345678 personId: 87654321","Multiple key value pairs >> New Column(s) with key as column Name",2345678,87654321,"")
,("CRON","AlphaNumeric without ':' >> Do not process","","","")
,("ABC9876543210","Alpha-NUMERIC starting with 'ABC' >> New Column","","","ABC9876543210")
]).toDF(["contextMap_ID1","Description of rules","caseId","personId","ABC"])
dfx.show(truncate=False)
答案 0 :(得分:0)
您可以做的是根据正则表达式列表确定一系列when
条件,以决定如何处理该行。然后,根据逻辑,您可以提取键值对(列名和值)的列表。
下面的代码中的逻辑可能并不完全符合您的要求(尽管它会产生您期望的输出),但是通过完成操作,您可以轻松添加或修改条件。
它可能看起来像这样
from pyspark.sql.import functions as F
# So let's first define the conditions and the associated logic
transfo=dict()
# List of pairs
transfo['^([a-zA-Z0-9]+\\s*:\\s*[a-zA-Z0-9]+\\s*)+$'] = F.split(
F.regexp_replace(F.col('contextMap_ID1'), "\\s*:\\s*", ":"), "\\s+")
# 9 digit number
transfo['^[0-9]{9}$'] = F.array(F.concat_ws(':',
F.lit("caseId"),
F.col("contextMap_ID1")))
# Three letters and a number
transfo['^[A-Z]{3}[0-9]+$'] = F.array(F.concat_ws(':',
F.regexp_extract(F.col("contextMap_ID1"), '[A-Z]+', 0),
F.regexp_extract(F.col("contextMap_ID1"), '[0-9]+', 0 )))
# let's combine the conditions into a chain of when/otherwise.
# the initialization of my_fun is meant to avoid discarding rows
# without key value pairs.
my_fun = F.array(F.lit('caseId'))
for x in transfo:
my_fun = F.when(F.col('contextMap_ID1').rlike(x),
transfo[x]).otherwise(my_fun)
一旦我们准备了主要的转换,我们就可以包装一切。我们使用my_fun
分解生成的键值对,将其提取并围绕键旋转以生成新列。
请注意,如果contextMap_ID1
不唯一,我们会添加一个ID。如果是这样,则可以删除该ID。
dfx\
.select('contextMap_ID1', F.monotonically_increasing_id().alias('id'))\
.select('contextMap_ID1', 'id', F.explode(my_fun).alias("keyvalue"))\
.withColumn("key", F.split(F.col('keyvalue'), ":").getItem(0))\
.withColumn("value", F.split(F.col('keyvalue'), ":").getItem(1))\
.groupBy("id", "contextMap_ID1")\
.pivot("key")\
.agg(F.first(F.col('value')))
.show(truncate=False)
+-----------+----------------------------------------------------------------------+----------+---------+--------+
|id |contextMap_ID1 |ABC |caseId |personId|
+-----------+----------------------------------------------------------------------+----------+---------+--------+
|34359738368|caseId: 2345678 personId: 87654321 |null |2345678 |87654321|
|25769803776|123456789 |null |123456789|null |
|60129542144|ABC9876543210 |9876543210|null |null |
|8589934592 |blah blah blah createdTimeStamp=2020-08-11 15:31:37.458 blah blah blah|null |null |null |
|51539607552|CRON |null |null |null |
+-----------+----------------------------------------------------------------------+----------+---------+--------+