我有一个数据框,其中有两列,地址和街道名称。
from pyspark.sql.functions import *
import pyspark.sql
df = spark.createDataFrame([\
['108 badajoz road north ryde 2113, nsw, australia', 'north ryde'],\
['25 smart street fairfield 2165, nsw, australia', 'smart street']
],\
['address', 'street_name'])
df.show(2, False)
+------------------------------------------------+---------------+
|address |street_name |
+------------------------------------------------+---------------+
|108 badajoz road north ryde 2113, nsw, australia|north ryde |
|25 smart street fairfield 2165, nsw, australia |smart street |
+------------------------------------------------+---------------+
我想查找street_name
中是否存在address
,并在新列中返回布尔值。我可以手动搜索模式,如下所示。
df.withColumn("new col", col("street").rlike('.*north ryde.*')).show(20,False)
----------------------------------------------+---------------+-------+
|address |street_name |new col|
+------------------------------------------------+------------+-------+
|108 badajoz road north ryde 2113, nsw, australia|north ryde |true |
|25 smart street fairfield 2165, nsw, australia |smart street|false |
+------------------------------------------------+------------+-------+
但是我想用下面的列street_name
替换手动值
df.withColumn("new col", col("street")\
.rlike(concat(lit('.*'),col('street_name'),col('.*))))\
.show(20,False)
答案 0 :(得分:2)
您只需使用contains
函数就可以做到这一点。有关更多详细信息,请参见this:
from pyspark.sql.functions import col, when
df = df.withColumn('new_Col',when(col('address').contains(col('street_name')),True).otherwise(False))
df.show(truncate=False)
+------------------------------------------------+------------+-------+
|address |street_name |new_Col|
+------------------------------------------------+------------+-------+
|108 badajoz road north ryde 2113, nsw, australia|north ryde |true |
|25 smart street fairfield 2165, nsw, australia |smart street|true |
+------------------------------------------------+------------+-------+
答案 1 :(得分:1)
一个简单的解决方案是定义一个UDF
并使用它。例如,
from pyspark.sql.functions import udf
def contains_address(address, street_name):
return street_name in address
contains_address_udf = udf(contains_address, BooleanType())
df.withColumn("new_col", contains_address_udf("address", "street_name")
在这里,可以简单地使用in
,但是如果需要更复杂的功能,只需将其替换为正则表达式即可。
答案 2 :(得分:1)
只需使用expr
函数:
from pyspark.sql import functions as F
df.select(
"address",
"street_name",
F.expr("address like concat('%',street_name,'%')")
).show()
+--------------------+------------+--------------------------------------+
| address| street_name|address LIKE concat(%, street_name, %)|
+--------------------+------------+--------------------------------------+
|108 badajoz road ...| north ryde| true|
|25 smart street f...|smart street| true|
+--------------------+------------+--------------------------------------+