使用PySpark中的Levenshtein距离在两列之间进行字符串匹配

时间:2019-09-05 12:36:18

标签: python dataframe apache-spark pyspark levenshtein-distance

我正在尝试通过将名称对之间的levenshtein距离转换为匹配的系数来比较名称对:

coef = 1-Levenstein(str1,str2)/ max(length(str1),length(str2))

但是,当我使用withColumn()在PySpark中实现它时,在计算max()函数时出现错误。 numpy.max和pyspark.sql.functions.max均引发错误。任何想法 ?

from pyspark.sql.functions import col, length, levenshtein

valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
TableA = spark.createDataFrame(valuesA,['firstname','id'])

test_compare = TableA.withColumnRenamed('firstname', 'firstname2').withColumnRenamed('id', 'id2').crossJoin(TableA)
test_compare.withColumn("distance_firstname", levenshtein('firstname', 'firstname2') / max(length(col('firstname')), length(col('firstname2'))))

1 个答案:

答案 0 :(得分:0)

max是一个聚合函数,同样可以从greatest

中查找要使用的pyspark.sql.functions的两个值之间的最大值。
from pyspark.sql.functions import col, length, greatest
from pyspark.sql.functions import levenshtein  
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
TableA = spark.createDataFrame(valuesA,['firstname','id'])

test_compare = TableA.withColumnRenamed('firstname', 'firstname2').withColumnRenamed('id', 'id2').crossJoin(TableA)
test_compare.withColumn("distance_firstname", levenshtein('firstname', 'firstname2') / greatest(length(col('firstname')), length(col('firstname2')))).show()