我正在尝试创建一个报告,根据三个标准显示潜在的重复记录:SSN的最后4个,姓氏和DOB。我在这个问题上发布了一个问题here并收到了一个答案,我应该使用Cross Apply来取消数据的转移。查询运行速度很快,结果看起来比原始查询要好。
新查询如下,我添加了一个过滤器,以显示我看到的两个数据示例:
DECLARE
@StartDate DATE = '1/1/2017',
@EndDate DATE = '3/1/2017';
WITH
CTE
AS
(
SELECT
DENSE_RANK() OVER (ORDER BY c.socialSecurityNumber) AS [SSNRanking] ,
c.id AS [CustomerID] ,
c.socialSecurityNumber AS [SSN],
c.firstName AS [FirstName] ,
c.lastName AS [LastName] ,
c.birthDate [BirthDate] ,
c.emailAddress AS [EmailAddress],
c.createDate AS [CreateDate] ,
MAX(co.orderDate) AS [LastOrderDate] ,
ca.street1 AS [Addr1] ,
ca.city AS [City] ,
ca.stateAndTerritoriesID AS [State],
ca.zipCode5 AS [Zip] ,
c2.id AS [DupCustomerID] ,
c2.socialSecurityNumber AS [DupSSN] ,
c2.firstName AS [DupFirstName] ,
c2.lastName AS [DupLastName] ,
c2.birthDate AS [DupBirthDate] ,
c2.emailAddress AS [DupEmailAddress] ,
c2.createDate AS [DupCreateDate] ,
MAX(co.orderDate) AS [DupLastOrderDate] ,
ca.street1 AS [DupAddr1] ,
ca.city AS [DupCity],
ca.stateAndTerritoriesID AS [DupState] ,
ca.zipCode5 AS [DupZip]
FROM
dbo.Customers AS [c]
INNER JOIN dbo.Customers AS [c2] ON ( SUBSTRING(c.socialSecurityNumber,6,4) = SUBSTRING(c2.socialSecurityNumber,6,4) AND c.birthDate = c2.birthDate AND c.lastName = c2.lastName AND c.id <> c2.id )
INNER JOIN dbo.CustomerAddresses AS [ca] ON c.id = ca.customerID
--INNER JOIN dbo.CustomerAddresses AS [ca2] ON ca2.customerID = c2.id
LEFT OUTER JOIN dbo.Common_Orders AS [co] ON co.customerID = c.id
WHERE
c.customerStatusTypeID <> 'M'
AND c2.customerStatusTypeID <> 'M'
AND ca.addressType = 'M'
--AND ca2.addressType = 'M'
AND c.mergedTo IS NULL
AND c2.mergedTo IS NULL
AND CAST(co.orderDate AS DATE) >= @StartDate
AND CAST(co.orderDate AS DATE) <= @EndDate
AND ( c.id = 1545229 OR c.id = 2020489 )
GROUP BY
c.id ,
c.socialSecurityNumber ,
c.firstName ,
c.lastName ,
c.birthDate ,
c.emailAddress ,
c.createDate ,
ca.street1 ,
ca.city ,
ca.stateAndTerritoriesID ,
ca.zipCode5 ,
c2.id ,
c2.socialSecurityNumber ,
c2.firstName ,
c2.lastName ,
c2.birthDate ,
c2.emailAddress ,
c2.createDate ,
ca.street1 ,
ca.city ,
ca.stateAndTerritoriesID ,
ca.zipCode5
)
SELECT CA.CustomerID,
CA.SSNRanking ,
CA.SSN ,
CA.FirstName,
CA.LastName,
CA.BirthDate,
CA.EmailAddress,
CA.CreateDate ,
CA.LastOrderDate ,
CA.Addr1,
CA.City,
CA.[State],
CA.Zip
FROM
CTE
CROSS APPLY
(
VALUES
(CTE.SSNRanking, CTE.CustomerID, CTE.SSN, CTE.FirstName, CTE.LastName, CTE.Birthdate, CTE.EmailAddress, CTE.CreateDate, CTE.LastOrderDate, CTE.Addr1, CTE.City, CTE.[State], CTE.Zip),
(CTE.SSNRanking, CTE.DupCustomerID, CTE.DupSSN, CTE.DupFirstName, CTE.DuplastName, CTE.DupBirthDate, CTE.DupEmailAddress, CTE.DupCreateDate, CTE.DupLastOrderDate, CTE.DupAddr1, CTE.DupCity, CTE.DupState, CTE.DupZip)
) AS CA (SSNRanking, CustomerID, SSN, FirstName, LastName, BirthDate, EmailAddress, CreateDate, LastOrderDate, Addr1, City, [State], Zip)
ORDER BY CAST(CA.SSN AS INT) ASC, CA.CustomerID;
结果集看起来像(Original Image):
+ ---------- + ---------- + --------- + ---------- + -------- + ---------- + ------------ + --------------------- + ----------------------- + --------------------- + ------------ + ----- + ----- +
| CustomerId | SSNRanking | SSN | FirstName | LastName | BirthDate | EmailAddress | CreateDate | LastOrderDate | Addr1 | City | State | Zip |
+ ---------- + ---------- + --------- + ---------- + -------- + ---------- + ------------ + --------------------- + ----------------------- + --------------------- + ------------ + ----- + ----- +
| 1545229 | 1 | 000000000 | Aquia Boat | SALES | 1900-01-01 | null | 2013-05-28 00:00:00.0 | 2017-01-23 11:08:30.723 | 236 Willow Landing Rd | Stafford | VA | 22554 |
| 1545229 | 1 | 000000000 | Aquia Boat | SALES | 1900-01-01 | null | 2013-05-28 00:00:00.0 | 2017-01-06 12:31:15.370 | 11963 Jefferson Ave | Newport News | VA | 23606 |
| 2020489 | 1 | 000000000 | DIXIE | SALES | 1900-01-01 | null | 2017-01-06 12:27:56.5 | 2017-01-06 12:31:15.370 | 11963 Jefferson Ave | Newport News | VA | 23606 |
| 2020489 | 1 | 000000000 | DIXIE | SALES | 1900-01-01 | null | 2017-01-06 12:27:56.5 | 2017-01-23 11:08:30.723 | 236 Willow Landing Rd | Stafford | VA | 22554 |
+ ---------- + ---------- + --------- + ---------- + -------- + ---------- + ------------ + --------------------- + ----------------------- + --------------------- + ------------ + ----- + ----- +
我看到SSN的最后四个,姓氏和DOB都匹配。但后来我注意到重复的客户ID Aqua Boats有一个带有Dixie销售地址的条目,而Dixie Sales有一个Aqua Boats地址的条目 - 我不应该这样,我查看了特定账户的客户/客户地址表:
SELECT c.id, c.socialSecurityNumber, c.firstName, c.lastName, c.birthDate, ca.street1
FROM dbo.Customers AS c
INNER JOIN dbo.CustomerAddresses AS [ca] ON ca.customerID = c.id
WHERE c.id IN (1545229,2020489) AND ca.addressType = 'M';
结果在这里(Original Image):
+ ---------- + -------------------- + ---------- + -------- + ---------- + --------------------- +
| id | socialSecurityNumber | firstName | lastName | birthDate | street 1 |
+ ---------- + -------------------- + ---------- + -------- + ---------- + --------------------- +
| 1545229 | 000000000 | Aquia Boat | SALES | 1900-01-01 | 236 Willow Landing Rd |
| 2020489 | 000000000 | DIXIE | SALES | 1900-01-01 | 11963 Jefferson Ave |
+ ---------- + -------------------- + ---------- + -------- + ---------- + --------------------- +
当我在CTE中运行查询时,我添加了两个额外的过滤器:
AND c.id <> ca2.customerID
AND c2.id <> ca.customerID
数据集看起来像我想要的(Original Image):
+ ---------- + ---------- + --------- + ---------- + -------- + ---------- + --------------------- + ------------ + ----- + ----- + ------------- + --------- + ------------- + ----------- + ------------- + --------------------- + ------------ + -------- + ------ +
| SSNRanking | CustomerID | SSN | FirstName | LastName | BirthDate | Addr1 | City | State | Zip | DupCustomerID | DupSSN | DupFirstName | DupLastName | DupBirthDate | DupAddr1 | DupCity | DupState | DupZip |
+ ---------- + ---------- + --------- + ---------- + -------- + ---------- + --------------------- + ------------ + ----- + ----- + ------------- + --------- + ------------- + ----------- + ------------- + --------------------- + ------------ + -------- + ------ +
| 1 | 1545229 | 000000000 | Aquia Boat | SALES | 1900-01-01 | 236 Willow Landing Rd | Stafford | VA | 22554 | 2020589 | 000000000 | DIXIE | SALES | 1900-01-01 | 236 Willow Landing Rd | Stafford | VA | 22554 |
| 1 | 2020489 | 000000000 | DIXIE | SALES | 1900-01-01 | 11963 Jefferson Ave | Newport News | VA | 23606 | 1545229 | 000000000 | AQUIA BOAT | SALES | 1900-01-01 | 11963 Jefferson Ave | Newport News | VA | 23606 |
+ ---------- + ---------- + --------- + ---------- + -------- + ---------- + --------------------- + ------------ + ----- + ----- + ------------- + --------- + ------------- + ----------- + ------------- + --------------------- + ------------ + -------- + ------ +
我是否可以阻止CROSS APPLY为分配给其他客户的邮寄地址的客户创建其他记录?
谢谢,
答案 0 :(得分:1)
我认为你应该抛出样本数据(至少1到20行)。
或者首先让您的查询正确,然后尝试查找重复记录。
你是什么意思&#34; SSN的最后4个,姓氏和DOB。 &#34;你得到什么样的数据?
看起来很简单,
DECLARE @SampleData AS TABLE ( SSN int,lastName varchar(50),DOB datetime)
INSERT INTO @SampleData VALUES (50,'Smith','1980-02-02'),
(50,'Smith1','1980-02-02'),(50,'Smith','1980-02-02'),(50,'Smith','1980-02-02')
;With CTE as
(
select *,ROW_NUMBER()over(PARTITION by ssn,lastName,dob order by (select null))rn
from @SampleData
)
select * from cte
where rn>=4
答案 1 :(得分:0)
我一直在摆弄报告,我通过添加另一个CTE和ROW_NUMBER函数来删除我看到的重复行,从而实现了我想要的结果。对于我想要的结果来说似乎有点过分,但这比我原来的查询要快得多:
IF OBJECT_ID('tempdb..#DupSSN') IS NOT NULL
DROP TABLE #DupSSN;
WITH cte
AS
(
SELECT DENSE_RANK() OVER (ORDER BY c.socialSecurityNumber) AS [SSNRanking] ,
c.socialSecurityNumber AS [SSN],
RTRIM(LTRIM(SUBSTRING(c.socialSecurityNumber, 1, 3))) + '-' + RTRIM(LTRIM(SUBSTRING(c.socialSecurityNumber, 4, 2))) + '-' + RTRIM(LTRIM(SUBSTRING(c.socialSecurityNumber, 6, 4))) AS [F_SSN] ,
c.id AS [CustomerID] ,
REPLACE(LTRIM(RTRIM(c.firstName)), CHAR(13) + CHAR(10), ' ') AS [FirstName] ,
REPLACE(LTRIM(RTRIM(c.lastName)), CHAR(13) + CHAR(10), ' ') AS [LastName] ,
c.birthDate AS [BirthDate] ,
MAX(co.orderDate) AS [LastOrderDate] ,
RTRIM(LTRIM(ca.street1)) AS [Addr1] ,
RTRIM(LTRIM(ca.city)) AS [City] ,
ca.stateAndTerritoriesID AS [State] ,
ca.zipCode5 AS [Zip] ,
RTRIM(LTRIM(c.emailAddress)) AS [EmailAddress] ,
c.createDate AS [CreateDate] ,
c2.socialSecurityNumber AS [DupSSN] ,
RTRIM(LTRIM(SUBSTRING(C2.socialSecurityNumber, 1, 3))) + '-' + RTRIM(LTRIM(SUBSTRING(C2.socialSecurityNumber, 4, 2))) + '-' + RTRIM(LTRIM(SUBSTRING(C2.socialSecurityNumber, 6, 4))) AS [F_DupSSn] ,
c2.id AS [DupCustomerID] ,
REPLACE(LTRIM(RTRIM(c2.firstName)), CHAR(13) + CHAR(10), ' ') AS [DupFirstName] ,
REPLACE(LTRIM(RTRIM(c2.lastName)), CHAR(13) + CHAR(10), ' ') AS [DupLastName] ,
c2.birthDate AS [DupBirthDate] ,
MAX(co2.orderDate) AS [DupLastOrderDate] ,
RTRIM(LTRIM(ca2.street1)) AS [DupAddr1] ,
RTRIM(LTRIM(ca2.city)) AS [DupCity] ,
ca2.stateAndTerritoriesID AS [DupState] ,
ca2.zipCode5 AS [DupZip] ,
RTRIM(LTRIM(c2.emailAddress)) AS [DupEmailAddress] ,
c2.createDate AS [DupCreateDate]
FROM dbo.Customers AS [c]
INNER JOIN dbo.Customers AS [c2] ON ( SUBSTRING(c.socialSecurityNumber,6,4) = SUBSTRING(c2.socialSecurityNumber,6,4) AND c.birthDate = c2.birthDate AND c.lastName = c2.lastName AND c.id <> c2.id )
INNER JOIN dbo.CustomerAddresses AS [ca] ON c.id = ca.customerID
INNER JOIN dbo.CustomerAddresses AS [ca2] ON c2.id = ca2.customerID
INNER JOIN dbo.Common_Orders AS [co] ON co.customerID = c.id
INNER JOIN dbo.Common_Orders AS [co2] ON co2.customerID = c2.id
WHERE c.customerStatusTypeID <> 'M'
AND c2.customerStatusTypeID <> 'M'
AND ca.addressType <> 'M'
AND ca2.addressType <> 'M'
AND c.mergedTo IS NULL
AND c2.mergedTo IS NULL
AND CAST(co.orderDate AS DATE) >= '3/1/2017'
AND CAST(co.orderDate AS DATE) <= '3/31/2017'
GROUP BY c.socialSecurityNumber ,
RTRIM(LTRIM(SUBSTRING(c.socialSecurityNumber, 1, 3))) + '-' + RTRIM(LTRIM(SUBSTRING(c.socialSecurityNumber, 4, 2))) + '-' + RTRIM(LTRIM(SUBSTRING(c.socialSecurityNumber, 6, 4))) ,
c.id ,
REPLACE(LTRIM(RTRIM(c.firstName)), CHAR(13) + CHAR(10), ' ') ,
REPLACE(LTRIM(RTRIM(c.lastName)), CHAR(13) + CHAR(10), ' ') ,
c.birthDate ,
RTRIM(LTRIM(CA.street1)) ,
RTRIM(LTRIM(CA.city)) ,
CA.stateAndTerritoriesID ,
CA.zipCode5 ,
RTRIM(LTRIM(c.emailAddress)) ,
c.createDate ,
c2.socialSecurityNumber ,
RTRIM(LTRIM(SUBSTRING(C2.socialSecurityNumber, 1, 3))) + '-' + RTRIM(LTRIM(SUBSTRING(C2.socialSecurityNumber, 4, 2))) + '-' + RTRIM(LTRIM(SUBSTRING(C2.socialSecurityNumber, 6, 4))) ,
c2.id ,
REPLACE(LTRIM(RTRIM(c2.firstName)), CHAR(13) + CHAR(10), ' ') ,
REPLACE(LTRIM(RTRIM(c2.lastName)), CHAR(13) + CHAR(10), ' ') ,
c2.birthDate ,
RTRIM(LTRIM(ca2.street1)) ,
RTRIM(LTRIM(ca2.city)) ,
ca2.stateAndTerritoriesID ,
ca2.zipCode5 ,
RTRIM(LTRIM(c2.emailAddress)) ,
c2.createDate
)
-- Use a CTE to cross apply or unpivot the potential duplicates into separate rows
SELECT ca.SSNRanking ,
ca.SSN ,
ca.F_SSN ,
ca.CustomerID ,
ca.FirstName ,
ca.LastName ,
ca.BirthDate ,
ca.LastOrderDate ,
ca.Addr1 ,
ca.City ,
ca.[State] ,
ca.Zip ,
ca.EmailAddress ,
ca.CreateDate
INTO #DupSSN
FROM cte
CROSS APPLY
(
VALUES
( cte.SSNRanking, cte.SSN, cte.F_SSN, cte.CustomerID, cte.FirstName, cte.LastName, cte.BirthDate, cte.LastOrderDate, cte.Addr1, cte.City, cte.[State], cte.Zip, cte.EmailAddress, cte.CreateDate ) ,
( cte.SSNRanking, cte.DupSSN, cte.F_SSN, cte.DupCustomerID, cte.DupFirstName, cte.DupLastName, cte.DupBirthDate, cte.DupLastOrderDate,cte.DupAddr1, cte.DupCity, cte.DupState, cte.DupZip, cte.DupEmailAddress, cte.DupCreateDate )
) AS ca (SSNRanking, SSN, F_SSN, CustomerID, FirstName, LastName, BirthDate, LastOrderDate, Addr1, City, State, Zip, EmailAddress, CreateDate)
ORDER BY CAST(ca.SSN AS INT) ASC, ca.CustomerID;
-- Use CTE to create ROW_NUMBER function to eliminate duplicate records
WITH SSN_RN AS
(
SELECT SSNRanking ,
ROW_NUMBER() OVER ( PARTITION BY SSN, CustomerID, FirstName, LastName, BirthDate, LastOrderDate, Addr1, City, CreateDate ORDER BY ( SELECT NULL ) ) AS [RowNumber] ,
SSN ,
F_SSN ,
CustomerID ,
FirstName ,
LastName ,
BirthDate ,
LastOrderDate ,
Addr1 ,
City ,
State ,
Zip ,
EmailAddress ,
CreateDate
FROM #DupSSN
)
SELECT SSN_RN.SSNRanking ,
SSN_RN.RowNumber ,
SSN_RN.SSN ,
SSN_RN.F_SSN ,
SSN_RN.CustomerID ,
SSN_RN.FirstName ,
SSN_RN.LastName ,
SSN_RN.BirthDate ,
SSN_RN.LastOrderDate ,
SSN_RN.Addr1 ,
SSN_RN.City ,
SSN_RN.State ,
SSN_RN.Zip ,
SSN_RN.EmailAddress ,
SSN_RN.CreateDate
FROM SSN_RN
WHERE SSN_RN.RowNumber = 1
ORDER BY CAST(SSN_RN.SSN AS INT) ASC ,
SSN_RN.CustomerID;
DROP TABLE #DupSSN;