Question

我有一些数据，其中列＆＃39; X＆＃39;包含字符串。我正在编写一个函数，使用pyspark，其中传递search_word，并且所有行中不包含子串search_word的行＆＃39; X＆＃39;字符串被过滤掉了。该功能还必须允许单词的拼写错误，即模糊匹配。我已将数据加载到pyspark数据框中，并使用NLTK和fuzzywuzzy python库编写函数，如果字符串包含search_word，则返回True或False。

我的问题是我无法正确地将函数映射到数据框。我是否错误地接近了这个问题？我是否应该尝试通过某种SQL查询进行模糊匹配，或者使用RDD？

我是pyspark的新手，所以我觉得这个问题以前一定得到了回答，但我无法在任何地方找到答案。我从来没有用SQL做任何NLP，我从来没有听说过SQL能够模糊匹配子字符串。

更新＃1

该功能如下：

wf = WordFinder(search_word='some_substring')
result1 = wf.find_word_in_string(string_to_search='string containing some_substring or misspelled some_sibstrung')
result2 = wf.find_word_in_string(string_to_search='string not containing the substring')

result1为True

result2为False

Answer 1

一种简单的方法是使用内置的levenstein函数。例如，

(
    spark.createDataFrame([("apple",), ("aple",), ("orange",), ("pear",)], ["fruit"])
    .withColumn("substring", func.lit("apple"))
    .withColumn("levenstein", func.levenshtein("fruit", "substring"))
    .filter("levenstein <= 1")
    .toPandas()
)

返回

   fruit substring  levenstein
0  apple     apple           0
1   aple     apple           1

如果你想使用vanilla Python函数，比如来自NLTK包的东西，你必须定义一个接受字符串并返回布尔值的UDF。

模糊匹配pyspark数据帧字符串

1 个答案: