PySpark:fillna但是用于字符串/正则表达式匹配

时间:2018-04-14 12:50:23

标签: apache-spark pyspark

我可以执行类似fillna的操作,除了匹配NA列,我可以执行字符串包含或正则表达式匹配之类的操作吗?

例如,我的位置列的值包含United StatesUS,有时候会为New York, USA。我想将这样的事情匹配到只有1个值United States。我该怎么做?

1 个答案:

答案 0 :(得分:2)

定义映射

lookup = [
    ("United States|US|USA", "United States"),
    ("UK|United Kingdom", "United Kingdom")
]

并使用when

from pyspark.sql import functions as F
from functools import reduce

df = spark.createDataFrame(
   ["United States", "US", "New York, USA", "UK", "London, United Kingdom"],
   "string"
)

country = reduce(
    lambda acc, r: F.when(F.col("value").rlike(r[0]), F.lit(r[1])).otherwise(acc),
    lookup, 
    F.lit("unknown"))


df.withColumn("country", country).show()
# +--------------------+--------------+
# |               value|       country|
# +--------------------+--------------+
# |       United States| United States|
# |                  US| United States|
# |       New York, USA| United States|
# |                  UK|United Kingdom|
# |London, United Ki...|United Kingdom|
# +--------------------+--------------+

join

df.crossJoin(F.broadcast(
    spark.createDataFrame(lookup, ("pattern", "country"))
)).where(F.expr("value rlike pattern")).drop("pattern").show()
# +--------------------+--------------+
# |               value|       country|
# +--------------------+--------------+
# |       United States| United States|
# |                  US| United States|
# |       New York, USA| United States|
# |                  UK|United Kingdom|
# |London, United Ki...|United Kingdom|
# +--------------------+--------------+