Question

我正在阅读2个hive表中的数据。令牌表具有需要与输入数据匹配的令牌。输入数据将包含描述列以及其他列。我需要拆分输入数据，并需要将每个拆分元素与令牌表中的所有元素进行比较。目前我正在使用me.xdrop.fuzzywuzzy.FuzzySearch库进行模糊匹配。

下面是我的代码段 -

val tokens = sqlContext.sql("select token from tokens")
val desc = sqlContext.sql("select description from desceriptiontable")
val desc_tokens = desc.flatMap(_.toString().split(" "))

现在我需要迭代desc_tokens并且desc_tokens的每个元素应该与标记的每个元素模糊匹配，并且它超过85％匹配我需要从标记中用元素替换desc_tokens中的元素。

示例 -

我的令牌列表是

hello
this
is
token
file
sample

我输入的描述是

helo this is input desc sampl

代码应返回

hello this is input desc sample

hello 和 helo 模糊匹配＆gt; 85％所以helo将被你好取代。同样适用于sampl。

Answer 1

我使用此库进行测试：https://github.com/rockymadden/stringmetric

其他想法（未优化）：

//I change order tokens
val tokens = Array("this","is","sample","token","file","hello");
val desc_tokens = Array("helo","this","is","token","file","sampl");

val res = desc_tokens.map(str => {
  //Compute score beetween tokens and desc_tokens
  val elem = tokens.zipWithIndex.map{ case(tok,index) => (tok,index,JaroMetric.compare(str, tok).get)}
  //Get token has max score
  val emax = elem.maxBy{case(_,_,score) => score}
  //if emax have a score > 0.85 get It. Else keep input
  if(emax._3 > 0.85) tokens(emax._2) else str

})
res.foreach { println }

我的输出： hello this is token file sample

使用带有scala的apache spark对两个hive列进行模糊比较

1 个答案:

使用带有scala的apache spark对两个hive列​​进行模糊比较

1 个答案:

使用带有scala的apache spark对两个hive列进行模糊比较