我有一个非常大的表(超过一百万行),这些行具有来自不同来源的产品名称和价格。
有许多同名产品但价格不同。
这是问题,
我们连续多次使用相同的产品,但它们的名称不一样,例如
Row Product name price
----- ----------------------- ----
Row 1 : XYZ - size information $a
Row 2. XYZ -Brand information $b
Row 3. xyz $c
我希望得到价格不同的所有产品。如果名称在行中相同,那么我可以轻松地进行自我加入,如Table1.Product_Name = Table1.Product_name和Table1.Price!= Table2.Price
但在这种情况下这不起作用:(
任何人都可以提出解决方案吗?
答案 0 :(得分:3)
您可以尝试使用regexp_replace
朝正确的方向前进:
create table tq84_products (
name varchar2(50),
price varchar2( 5)
);
三种产品:
其中ABCD有两个价格相同的记录,而其他所有记录的价格都不同。
insert into tq84_products values (' XYZ - size information', '$a');
insert into tq84_products values ('XYZ - brand information', '$b');
insert into tq84_products values ('xyz' , '$c');
insert into tq84_products values ('Product ABCD' , '$d');
insert into tq84_products values ('Abcd is the best' , '$d');
insert into tq84_products values ('efghi is cheap' , '$f');
insert into tq84_products values ('no, efghi is expensive' , '$g');
带有停用词的select语句,用于删除通常在商品名称中找到的字词。
with split_into_words as (
select
name,
price,
upper (
regexp_replace(name,
'\W*' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?\W?+' ||
'(\w+)?' ||
'.*',
'\' || submatch.counter
)
) word
from
tq84_products,
(select
rownum counter
from
dual
connect by
level < 10
) submatch
),
stop_words as (
select 'IS' word from dual union all
select 'BRAND' word from dual union all
select 'INFORMATION' word from dual
)
select
w1.price,
w2.price,
w1.name,
w2.name
-- substr(w1.word, 1, 30) common_word,
-- count(*) over (partition by w1.name) cnt
from
split_into_words w1,
split_into_words w2
where
w1.word = w2.word and
w1.name < w2.name and
w1.word is not null and
w2.word is not null and
w1.word not in (select word from stop_words) and
w2.word not in (select word from stop_words) and
w1.price != w2.price;
然后选择
$a $b XYZ - size information XYZ - brand information
$b $c XYZ - brand information xyz
$a $c XYZ - size information xyz
$f $g efghi is cheap no, efghi is expensive
因此,abcd不会返回,而其他人则是。