如何在Oracle中使用模糊匹配获得准确的JOIN

时间:2017-04-28 10:13:31

标签: sql oracle fuzzy-comparison jaro-winkler

我正在尝试从另一个表中的一个带有县名的表中加入一组县名。这里的问题是,两个表中的县名都没有标准化。它们的数量不一样;此外,他们可能不会总是以类似的模式出现。例如,“表A”中的县“SAINT JOHNS”可以在“表B”中表示为“ST JOHNS”。我们无法预测它们的共同模式。

这意味着,我们不能在加入时使用“等于”(=)条件。所以,我正在尝试使用oracle中的JARO_WINKLER_SIMILARITY函数加入它们。 我的左外连接条件如下:

Table_A.State = Table_B.State 
AND UTL_MATCH.JARO_WINKLER_SIMILARITY(Table_A.County_Name,Table_B.County_Name)>=80

在对结果进行一些测试后,我已经给出了测量值80,它似乎是最佳的。 在这里,问题是我在加入时会得到一组“误报”。例如,如果某些县在相同的州(例如“BARRY”和“BAY”)中具有相似性,则如果该度量为>=80,则将匹配它们。 这会创建一组不准确的联接数据。 任何人都可以建议一些解决方法吗?

谢谢, DAV

2 个答案:

答案 0 :(得分:2)

  

你可以帮我构建一个查询,查询表B / C / D中每条记录的Table_A,并匹配A中的县名,其中排名相似度最高的是> = 80

Oracle安装程序

CREATE TABLE official_words ( word ) AS
  SELECT 'SAINT JOHNS' FROM DUAL UNION ALL
  SELECT 'MONTGOMERY' FROM DUAL UNION ALL
  SELECT 'MONROE' FROM DUAL UNION ALL
  SELECT 'SAINT JAMES' FROM DUAL UNION ALL
  SELECT 'BOTANY BAY' FROM DUAL;

CREATE TABLE words_to_match ( word ) AS
  SELECT 'SAINT JOHN' FROM DUAL UNION ALL
  SELECT 'ST JAMES' FROM DUAL UNION ALL
  SELECT 'MONTGOMERY BAY' FROM DUAL UNION ALL
  SELECT 'MONROE ST' FROM DUAL;

<强>查询

SELECT *
FROM   (
  SELECT wtm.word,
         ow.word AS official_word,
         UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) AS similarity,
         ROW_NUMBER() OVER ( PARTITION BY wtm.word ORDER BY UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) DESC ) AS rn
  FROM   words_to_match wtm
         INNER JOIN
         official_words ow
         ON ( UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word )>=80 )
)
WHERE rn = 1;

<强>输出

WORD           OFFICIAL_WO SIMILARITY         RN
-------------- ----------- ---------- ----------
MONROE ST      MONROE              93          1
MONTGOMERY BAY MONTGOMERY          94          1
SAINT JOHN     SAINT JOHNS         98          1
ST JAMES       SAINT JAMES         80          1

答案 1 :(得分:0)

使用内联的一些组成测试数据(您将使用自己的TABLE_A和TABLE_B代替前两个 def _get_cost(self, logits, cost_name, cost_kwargs): Optional arguments are: class_weights: weights for the different classes in case of multi-class imbalance regularizer: power of the L2 regularizers added to the loss function flat_logits = tf.reshape(logits, [-1, self.n_class]) flat_labels = tf.reshape(self.y, [-1, self.n_class]) if cost_name == "cross_entropy": class_weights = cost_kwargs.pop("class_weights", None) if class_weights is not None: class_weights = tf.constant(np.array(class_weights, dtype=np.float32)) weight_map = tf.multiply(flat_labels, class_weights) weight_map = tf.reduce_sum(weight_map, axis=1) loss_map = tf.nn.softmax_cross_entropy_with_logits(flat_logits, flat_labels) weighted_loss = tf.multiply(loss_map, weight_map) loss = tf.reduce_mean(weighted_loss) else: loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=flat_logits, labels=flat_labels))} 子句,并从with开始):

with matches as ...

结果:

with table_a (state, county_name) as
     ( select 'A', 'ST JOHNS' from dual union all
       select 'A', 'BARRY' from dual union all
       select 'B', 'CHEESECAKE' from dual union all
       select 'B', 'WAFFLES' from dual union all
       select 'C', 'UMBRELLAS' from dual )
   , table_b (state, county_name) as
     ( select 'A', 'SAINT JOHNS' from dual union all
       select 'A', 'SAINT JOANS' from dual union all
       select 'A', 'BARRY' from dual union all
       select 'A', 'BARRIERS' from dual union all
       select 'A', 'BANANA' from dual union all
       select 'A', 'BANOFFEE' from dual union all
       select 'B', 'CHEESE' from dual union all
       select 'B', 'CHIPS' from dual union all
       select 'B', 'CHICKENS' from dual union all
       select 'B', 'WAFFLING' from dual union all
       select 'B', 'KITTENS' from dual union all
       select 'C', 'PUPPIES' from dual union all
       select 'C', 'UMBRIA' from dual union all
       select 'C', 'UMBRELLAS' from dual )
   , matches as
     ( select a.state, a.county_name, b.county_name as matched_name
            , utl_match.jaro_winkler_similarity(a.county_name,b.county_name) as score
       from   table_a a
              join table_b b on b.state = a.state  )
   , ranked_matches as
     ( select m.*
            , rank() over (partition by m.state, m.county_name order by m.score desc) as ranking
       from   matches m
       where  score > 50 )
select rm.state, rm.county_name, rm. matched_name, rm.score
from   ranked_matches rm
where  ranking = 1
order by 1,2;

这个想法是STATE COUNTY_NAME MATCHED_NAME SCORE ----- ----------- ------------ ---------- A BARRY BARRY 100 A ST JOHNS SAINT JOHNS 80 B CHEESECAKE CHEESE 92 B WAFFLES WAFFLING 86 C UMBRELLAS UMBRELLAS 100 计算所有得分,matches在(ranked_matchesstate)内为它们分配一个序列,最终查询选择所有得分最高者(即过滤county_name)。

你可能仍然会得到一些重复,因为没有什么可以阻止两个不同的模糊匹配得分相同。