Oracle数据库中的部分匹配

时间:2011-01-27 05:48:25

标签: sql oracle

我有一个非常大的表(超过一百万行),这些行具有来自不同来源的产品名称和价格。

有许多同名产品但价格不同。

这是问题,

我们连续多次使用相同的产品,但它们的名称不一样,例如

Row    Product name             price
-----  -----------------------  ---- 
Row 1 : XYZ - size information   $a
Row 2. XYZ -Brand information    $b
Row 3. xyz                       $c

我希望得到价格不同的所有产品。如果名称在行中相同,那么我可以轻松地进行自我加入,如Table1.Product_Name = Table1.Product_name和Table1.Price!= Table2.Price

但在这种情况下这不起作用:(

任何人都可以提出解决方案吗?

1 个答案:

答案 0 :(得分:3)

您可以尝试使用regexp_replace朝正确的方向前进:

create table tq84_products (
  name   varchar2(50),
  price  varchar2( 5)
);

三种产品:

  • XYZ
  • ABCD
  • efghi

其中ABCD有两个价格相同的记录,而其他所有记录的价格都不同。

insert into tq84_products values (' XYZ - size information', '$a');
insert into tq84_products values ('XYZ - brand information', '$b');
insert into tq84_products values ('xyz'                    , '$c');

insert into tq84_products values ('Product ABCD'           , '$d');
insert into tq84_products values ('Abcd is the best'       , '$d');

insert into tq84_products values ('efghi is cheap'         , '$f');
insert into tq84_products values ('no, efghi is expensive' , '$g');

带有停用词的select语句,用于删除通常在商品名称中找到的字词。

with split_into_words as (
      select 
        name,
        price,
        upper (
        regexp_replace(name,
                             '\W*'  ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?\W?+' ||
                       '(\w+)?'     ||
                       '.*',
                       '\' || submatch.counter
                     ) 
        )                          word
         from
           tq84_products,
           (select
              rownum counter
            from 
              dual
            connect by
              level < 10
           ) submatch
  ),
  stop_words as (
    select 'IS'          word from dual union all
    select 'BRAND'       word from dual union all
    select 'INFORMATION' word from dual 
  )
  select
    w1.price,
    w2.price,
    w1.name,  
    w2.name
--  substr(w1.word, 1, 30)               common_word,
--  count(*) over (partition by w1.name) cnt
  from
    split_into_words w1,
    split_into_words w2
  where
    w1.word   = w2.word and
    w1.name  <  w2.name and
    w1.word is not null and
    w2.word is not null and
    w1.word not in (select word from stop_words) and
    w2.word not in (select word from stop_words) and
    w1.price != w2.price;

然后选择

$a    $b     XYZ - size information                            XYZ - brand information
$b    $c    XYZ - brand information                            xyz
$a    $c     XYZ - size information                            xyz
$f    $g    efghi is cheap                                     no, efghi is expensive

因此,abcd不会返回,而其他人则是。