模糊匹配pyspark数据帧字符串中的单词-Java 学习之路

我有一些数据，其中'X'列包含字符串 . 我正在编写一个函数，使用pyspark，其中传递search_word，并且过滤掉列'X'字符串中不包含子字符串search_word的所有行 . 该功能还必须允许单词的拼写错误，即模糊匹配 . 我已将数据加载到pyspark数据框中，并使用NLTK和fuzzywuzzy python库编写函数，如果字符串包含search_word，则返回True或False .

我的问题是我无法正确地将函数映射到数据框 . 我是否错误地接近了这个问题？我应该尝试通过某种SQL查询进行模糊匹配，还是使用RDD？

我是pyspark的新手，所以我觉得这个问题一定得到了回答，但我无法在任何地方找到答案 . 我从来没有用SQL做任何NLP，我从来没有听说过SQL能够模糊匹配子串 .

更新＃1

该功能如下：

wf = WordFinder(search_word='some_substring')
result1 = wf.find_word_in_string(string_to_search='string containing some_substring or misspelled some_sibstrung')
result2 = wf.find_word_in_string(string_to_search='string not containing the substring')

result1为True

result2为False

1 回答

一种简单的方法是使用内置的 levenstein 函数 . 例如，

(
    spark.createDataFrame([("apple",), ("aple",), ("orange",), ("pear",)], ["fruit"])
    .withColumn("substring", func.lit("apple"))
    .withColumn("levenstein", func.levenshtein("fruit", "substring"))
    .filter("levenstein <= 1")
    .toPandas()
)

回报

fruit substring  levenstein
0  apple     apple           0
1   aple     apple           1

如果你想使用vanilla Python函数，比如来自NLTK包的东西，你必须定义一个接受字符串并返回布尔值的UDF .

回复于 2024-04-27T00:59:58+08:00

模糊匹配pyspark数据帧字符串中的单词

1 回答

相关问题