我有一个pyspark数据框:
示例:
text <String> | name <String> | original_name <String>
----------------------------------------------------------------------------
HELLOWORLD2019THISISGOOGLE | WORLD2019 | WORLD_2019
----------------------------------------------------------------------------
NATUREISVERYGOODFORHEALTH | null | null
----------------------------------------------------------------------------
THESUNCONTAINVITAMIND | VITAMIND | VITAMIN_D
----------------------------------------------------------------------------
BECARETOOURHEALTHISVITAMIND | OURHEALTH | OUR_/HEALTH
----------------------------------------------------------------------------
我想循环name
列并查看name
中是否存在text
值,如果是,我创建一个new_column
,将包含{{1在original_name
中找到的name
个值中的}}个值。知道有时dataframe列为text
。
示例:
在数据框示例的第4行中,null
包含text
列中的2个值:name
,我应该获取其[OURHEALTH, VITAMIND]
值并存储它们放在original_name
中。
在第2行中,new_column
包含text
列中的OURHEALTH
,我应该在name
中存储原始的new_column
值找到==> name
预期结果:
[OUR_/HEALTH]
我希望我的解释清楚。
我尝试了以下代码:
text <String> | name <String> | original_name <String> | new_column <Array>
------------------------------|------------------|---------------------------|----------------------------
HELLOWORLD2019THISISGOOGLE | WORLD2019 | WORLD_2019 | [WORLD_2019]
------------------------------|------------------|---------------------------|----------------------------
NATUREISVERYGOODFOROURHEALTH | null | null | [OUR_/HEALTH]
------------------------------|------------------|---------------------------|----------------------------
THESUNCONTAINVITAMIND | VITAMIND | VITAMIN_D | [VITAMIN_D]
------------------------------|------------------|---------------------------|----------------------------
BECARETOOURHEALTHISVITAMIND | OURHEALTH | OUR_/HEALTH | [OUR_/HEALTH, VITAMIN_D ]
-----------------------------------------------------------------------------|----------------------------
有人可以帮助我吗? 谢谢
答案 0 :(得分:2)
一种简单的解决方案是在原始DataFrame和仅包含join
列的派生DataFrame之间使用name
。由于连接条件可以由多行满足,因此我们必须在连接后对原始列进行分组。
以下是您输入的详细示例:
data = [
("HELLOWORLD2019THISISGOOGLE", "WORLD2019", "WORLD_2019"),
("NATUREISVERYGOODFOROURHEALTH", None, None),
("THESUNCONTAINVITAMIND", "VITAMIND", "VITAMIN_D"),
("BECARETOOURHEALTHISVITAMIND", "OURHEALTH", "OUR_ / HEALTH")
]
df = spark.createDataFrame(data, ["text", "name", "original_name"])
# create new DF with search words
# as it's the originl_name which interests us for the final list so we select it too
search_df = df.select(struct(col("name"), col("original_name")).alias("search_match"))
# join on df.text contains search_df.name
df_join = df.join(search_df, df.text.contains(search_df["search_match.name"]), "left")
# group by original columns and collect matches in a list
df_join.groupBy("text", "name", "original_name")\
.agg(collect_list(col("search_match.original_name")).alias("new_column"))\
.show(truncate=False)
输出:
+----------------------------+---------+-------------+--------------------------+
|text |name |original_name|new_column |
+----------------------------+---------+-------------+--------------------------+
|HELLOWORLD2019THISISGOOGLE |WORLD2019|WORLD_2019 |[WORLD_2019] |
|THESUNCONTAINVITAMIND |VITAMIND |VITAMIN_D |[VITAMIN_D] |
|NATUREISVERYGOODFOROURHEALTH|null |null |[OUR_ / HEALTH] |
|BECARETOOURHEALTHISVITAMIND |OURHEALTH|OUR_ / HEALTH|[VITAMIN_D, OUR_ / HEALTH]|
+----------------------------+---------+-------------+--------------------------+