尝试从单个表(SQL)创建6个分段的“组”数据

时间:2018-09-28 02:41:00

标签: sql-server tsql

我想做的是获取包含29,268条记录的源数据,并从中创建六组不同的(通过电子邮件地址,这是数据中的字段)唯一的数据集。这是我的基本查询,可获取4,878条记录(从概念上讲,该查询将执行6次,但是我需要做的是每次能够通过电子邮件地址获得一组新的独特的4,878条记录(其中连续查询运行中的电子邮件地址在先前的运行中将不存在)。我在想什么,但我不确定如何继续进行自己需要做的事情。我将自己归为SQL的中级。这有点让我烦恼。有任何想法吗?

select top 1124 * from
Master_Subscribers_Score_GTE_5
where [E-mail Address] like '%YAHOO.COM%'

union all

select top 402 * from
Master_Subscribers_Score_GTE_5
where ([E-mail Address] like '%HOTMAIL.COM%' or [E-mail Address] like '%LIVE.COM%')

union all

select top 45 * from
Master_Subscribers_Score_GTE_5
where [E-mail Address] like '%AOL.COM%'

union all

select top 2353 * from
Master_Subscribers_Score_GTE_5
where [E-mail Address] like '%GMAIL.COM%'

union all

select top 164 * from
Master_Subscribers_Score_GTE_5
where ([E-mail Address] like '%ATT.COM%' or [E-mail Address] like '%SBCGLOBAL.NET%')

union all

select top 8 * from
Master_Subscribers_Score_GTE_5
where [E-mail Address] like '%COX.NET%'

union all

select top 3 * from
Master_Subscribers_Score_GTE_5
where [E-mail Address] like '%VERIZON.NET%'

union all

select top 70 * from
Master_Subscribers_Score_GTE_5
where [E-mail Address] like '%RR.COM%'

union all

select top 712 * from
Master_Subscribers_Score_GTE_5
where [E-mail Address] not like '%YAHOO.COM%' and
[E-mail Address] not like '%HOTMAIL.COM%' and
[E-mail Address] not like '%LIVE.COM%' and
[E-mail Address] not like '%AOL.COM%' and
[E-mail Address] not like '%GMAIL.COM%' and
[E-mail Address] not like '%ATT.COM%' and
[E-mail Address] not like '%SBCGLOBAL.NET%' and
[E-mail Address] not like '%COX.NET%' and
[E-mail Address] not like '%VERIZON.NET%' and
[E-mail Address] not like '%RR.COM%'

2 个答案:

答案 0 :(得分:1)

首先,使用LIKE有其缺点。看看this post

您可以使用SUBSTRINGCHARINDEX来获取电子邮件地址提供商(主机)

以下将获取电子邮件提供商

SUBSTRING(Email, CHARINDEX('@', Email, 1)+1, LEN(EmailR) - CHARINDEX('@', Email, 1))

现在,由于您已经获得了需要过滤的部分,因此可以使用它来过滤记录,然后使用ROW_NUMBER()获取每个提供程序的记录数,这些记录将再次用于进一步过滤。您可以使用CASE完成记录。

这里是一个示例:

SELECT *
FROM (
    SELECT *
    ,   CASE
            WHEN  UPPER(EmailDomain) = 'YAHOO.COM' AND RN <= 1124 
            THEN 'Group 1'
            WHEN  UPPER(EmailDomain) = 'HOTMAIL.COM' AND RN <= 402
            THEN 'Group 2'
            WHEN  UPPER(EmailDomain) = 'AOL.COM' AND RN <= 45
            THEN 'Group 3'
            WHEN  UPPER(EmailDomain) = 'GMAIL.COM' AND RN <= 2353
            THEN 'Group 4'
            WHEN  (UPPER(EmailDomain) = 'ATT.COM' OR UPPER(EmailDomain) = 'SBCGLOBAL.NET') AND RN < 164
            THEN 'Group 5'
            WHEN  UPPER(EmailDomain) = 'COX.NET'  AND RN <= 8
            THEN 'Group 6'
            WHEN  UPPER(EmailDomain) = 'VERIZON.NET' AND RN <= 3
            THEN 'Group 7'
            WHEN  UPPER(EmailDomain) = 'RR.COM' AND RN <= 70
            THEN 'Group 8'
            WHEN  UPPER(EmailDomain) NOT IN('YAHOO.COM','HOTMAIL.COM','AOL.COM','GMAIL.COM','ATT.COM','SBCGLOBAL.NET','COX.NET','VERIZON.NET','RR.COM') AND RN <= 712
            THEN 'Group 9'
            ELSE NULL
        END EmailGroup
    FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY EmailDomain ORDER BY EmailDomain) RN 
FROM (
SELECT 
    Email  
,   SUBSTRING(Email, CHARINDEX('@', Email, 1)+1, LEN(EmailR) - CHARINDEX('@', Email, 1)) EmailDomain 
FROM 
    Master_Subscribers_Score_GTE_5
) D 
) C
) E
WHERE 
    EmailGroup IS NOT NULL 

注意,我已经使用ROW_NUMBER()代替了SELECT TOP x。然后,我只是给在任何条件下都不适合的记录提供了NULL,这为我提供了一种简单的方法来仅显示我需要的内容,并用剩余的NULL填充其余部分以排除结果。

我使用UPPER()是因为我不知道您的数据库排序规则-是否区分大小写。所以我用它来克服这一点。如果您的数据库不区分大小写,则不需要它。

我希望这会有所帮助。

答案 1 :(得分:0)

with ranked as (
    select m.*, n = row_number() over (partition by b.bucket order by m.[E-mail Address])
    from Master_Subscribers_Score_GTE_5 m
    outer apply (select bucket from (values
        ('yahoo.com'), ('hotmail.com,live.com'),
        ('aol.com'), ('gmail.com'), ('att.com,sbcglobal.net'),
        ('cox.net'), ('verizon.net'), ('rr.com'))
        _(bucket) where exists (
            select * from string_split(bucket, ',')
            where m.[E-mail Address] like '%' + value + '%')) b)

select * from ranked where n % 6 = 0

..应该为您提供yahoo.com的1124,为hotmail.com和live.com的402,等等。然后查询n % 6 = 1的下一组n % 6 = 2的位置,依此类推。