我正在尝试通过将名称对之间的levenshtein距离转换为匹配的系数来比较名称对:
coef = 1-Levenstein(str1,str2)/ max(length(str1),length(str2))
但是,当我使用withColumn()在PySpark中实现它时,在计算max()函数时出现错误。 numpy.max和pyspark.sql.functions.max均引发错误。任何想法 ?
from pyspark.sql.functions import col, length, levenshtein
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
TableA = spark.createDataFrame(valuesA,['firstname','id'])
test_compare = TableA.withColumnRenamed('firstname', 'firstname2').withColumnRenamed('id', 'id2').crossJoin(TableA)
test_compare.withColumn("distance_firstname", levenshtein('firstname', 'firstname2') / max(length(col('firstname')), length(col('firstname2'))))
答案 0 :(得分:0)
max
是一个聚合函数,同样可以从greatest
pyspark.sql.functions
的两个值之间的最大值。
from pyspark.sql.functions import col, length, greatest
from pyspark.sql.functions import levenshtein
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
TableA = spark.createDataFrame(valuesA,['firstname','id'])
test_compare = TableA.withColumnRenamed('firstname', 'firstname2').withColumnRenamed('id', 'id2').crossJoin(TableA)
test_compare.withColumn("distance_firstname", levenshtein('firstname', 'firstname2') / greatest(length(col('firstname')), length(col('firstname2')))).show()