我可以执行类似fillna
的操作,除了匹配NA列,我可以执行字符串包含或正则表达式匹配之类的操作吗?
例如,我的位置列的值包含United States
,US
,有时候会为New York, USA
。我想将这样的事情匹配到只有1个值United States
。我该怎么做?
答案 0 :(得分:2)
定义映射
lookup = [
("United States|US|USA", "United States"),
("UK|United Kingdom", "United Kingdom")
]
并使用when
from pyspark.sql import functions as F
from functools import reduce
df = spark.createDataFrame(
["United States", "US", "New York, USA", "UK", "London, United Kingdom"],
"string"
)
country = reduce(
lambda acc, r: F.when(F.col("value").rlike(r[0]), F.lit(r[1])).otherwise(acc),
lookup,
F.lit("unknown"))
df.withColumn("country", country).show()
# +--------------------+--------------+
# | value| country|
# +--------------------+--------------+
# | United States| United States|
# | US| United States|
# | New York, USA| United States|
# | UK|United Kingdom|
# |London, United Ki...|United Kingdom|
# +--------------------+--------------+
或join
df.crossJoin(F.broadcast(
spark.createDataFrame(lookup, ("pattern", "country"))
)).where(F.expr("value rlike pattern")).drop("pattern").show()
# +--------------------+--------------+
# | value| country|
# +--------------------+--------------+
# | United States| United States|
# | US| United States|
# | New York, USA| United States|
# | UK|United Kingdom|
# |London, United Ki...|United Kingdom|
# +--------------------+--------------+