如何从数据框的另一列替换Pyspark数据框列中的字符串

时间:2020-01-09 20:36:28

标签: pyspark apache-spark-sql

专家,这是次要的,但我无法正确解决。

+--------------+----------------------------------------------------------+-------------------+
|table         |query                                                     |date               |
+--------------+----------------------------------------------------------+-------------------+
|AGENT         |select * from table where DW_EFFECTIVE_DATE_PARTITION ='X'|2019-12-24 00:00:00|
+--------------+----------------------------------------------------------+-------------------+

我在此数据框中想要做的就是将列查询更改为-

select * from table where DW_EFFECTIVE_DATE_PARTITION ='2019-12-24 00:00:00'

我尝试过-

>>> dfX.withColumn('query',regexp_replace('query',"'X'","'" + dfX['d'] + "'")).show()
Traceback (most recent call last):
TypeError: 'Column' object is not callable

所需的输出-

+--------------+----------------------------------------------------------------------------+-------------------+
|table         |query                                                                       |date             |
+--------------+----------------------------------------------------------------------------+-------------------+
|AGENT         |select * from table where DW_EFFECTIVE_DATE_PARTITION ='2019-12-24 00:00:00'|2019-12-24 00:00:00|
+--------------+----------------------------------------------------------------------------+-------------------+

3 个答案:

答案 0 :(得分:2)

您可以使用selectExpr代替withColumn

>>> df.selectExpr("table","regexp_replace(query, 'X', date) as query", "date").show(truncate=False)
+-----+----------------------------------------------------------------------------+-------------------+
|table|query                                                                       |date               |
+-----+----------------------------------------------------------------------------+-------------------+
|AGENT|select * from table where DW_EFFECTIVE_DATE_PARTITION ='2019-12-24 00:00:00'|2019-12-24 00:00:00|
+-----+----------------------------------------------------------------------------+-------------------+

答案 1 :(得分:1)

regexp_replaceexpr一起使用,这样可以用另一个列值替换字符串:

replace_expr = """regexp_replace(query,"'X'",concat("'", date, "'"))"""
df.withColumn("query", expr(replace_expr)).show(truncate=False)

礼物:

+-----+----------------------------------------------------------------------------+-------------------+
|table|query                                                                       |date               |
+-----+----------------------------------------------------------------------------+-------------------+
|AGENT|select * from table where DW_EFFECTIVE_DATE_PARTITION ='2019-12-24 00:00:00'|2019-12-24 00:00:00|
+-----+----------------------------------------------------------------------------+-------------------+

答案 2 :(得分:0)

def replace_string(s):
  if s == "A":
    return "a"
  else:
    return "b"
replace_string_udf = spark.udf.register("replace_string", replace_string, StringType())
df = df.withColumn("new_column", replace_string_udf("old_column_name"))