我想使用字典在pyspark数据帧列上执行regexp_replace操作。
词典:{'RD':'ROAD','DR':'DRIVE','AVE':'AVENUE',....}
字典将有大约270个键值对。
输入数据框:
ID | Address
1 | 22, COLLINS RD
2 | 11, HEMINGWAY DR
3 | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR
所需的输出数据帧:
ID | Address | Address_Clean
1 | 22, COLLINS RD | 22, COLLINS ROAD
2 | 11, HEMINGWAY DR | 11, HEMINGWAY DRIVE
3 | AVIATOR BUILDING | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR | 33, PARK AVENUE MULLOHAND DRIVE
我在互联网上找不到任何文档。如果试图传递字典如下代码 -
data=data.withColumn('Address_Clean',regexp_replace('Address',dict))
引发错误“regexp_replace需要3个参数,2个给定”。
数据集大小约为2000万。因此,UDF解决方案将很慢(由于行方式操作)并且我们无法访问支持pandas_udf的spark 2.3.0。 除了可能使用循环之外,有没有其他有效的方法呢?
答案 0 :(得分:0)
由于regexp_replace()需要三个参数,这使您感到困惑:
regexp_replace('column_to_change','pattern_to_be_changed','new_pattern')
但是您是对的,这里不需要UDF或循环。您只需要更多的regexp和看起来像原始目录的目录表即可:)
这是我的解决方案:
# You need to get rid of all the things you want to replace.
# You can use the OR (|) operator for that.
# You could probably automate that and pass it a string that looks like that instead but I will leave that for you to decide.
input_df = input_df.withColumn('start_address', sf.regexp_replace("original_address","RD|DR|etc...",""))
# You will still need the old ends in a separate column
# This way you have something to join on your directory table.
input_df = input_df.withColumn('end_of_address',sf.regexp_extract('original_address',"(.*) (.*)", 2))
# Now we join the directory table that has two columns - ends you want to replace and ends you want to have instead.
input_df = directory_df.join(input_df,'end_of_address')
# And now you just need to concatenate the address with the correct ending.
input_df = input_df.withColumn('address_clean',sf.concat('start_address','correct_end'))