从文件或表中删除重复项

时间:2017-12-22 10:24:27

标签: python sql-server sql-server-2008-r2 anaconda

我在数据库表中有数据并且我将其导出到这样的文件中并且有大约100万条记录(这是基于id的重复类型)

            id     |  dp_1   | pp_1  | Phone  |
            -------|---------|-------|--------|
            1      | dp1     |       | phone1 |
            ----------------------------------|
            1      |         | pp1   | phone1 |
            ----------------------------------|
            2      |  dp2    | pp2   | phone2 |
            ------------------------------------
            2      |         |       | phone4 |
            -----------------------------------
            3      |  dp3    | pp3   | phone3 |
            ------------------------------------
            3      |  dp3    |       | phone3 |
            -----------------------------------
            4      |         | pp4   |        |
            ------------------------------------
            4      |  dp4    |       |        |

我希望结果如下:

            id     |  dp_1   | pp_1  | Phone           |
            -------|---------|-------|-----------------|
            1      | dp1     |  pp1  | phone1 - phone1 |
            -------------------------------------------|
            2      | dp2     | pp2   | phone2 - phone4 |
            -------------------------------------------|
            3      | dp3     | pp3   | phone3          |
            -------------------------------------------|

            4      |   dp4   | pp4   |                 |
            --------------------------------------------

我写了这个SQL:

WITH cte AS (
  SELECT*, 
 row_number() OVER(PARTITION BY id,DP_1, PP_1, phone ORDER BY id desc) 
 AS [rn]
 FROM table1
   )
Select * into #temp from cte WHERE [rn] = 1 ORDER BY id

如何在Python中或使用SQL查询实现此目的?我正在使用Anaconda。

3 个答案:

答案 0 :(得分:1)

我不明白为什么id 1和3有不同的电话逻辑(一个重复数字,一个不重复)。此答案可以复制电话(如id 1)或返回DISTINCT值(如id 3)。您可以通过取消注释GROUP BY来更改逻辑。

--Sample Data
WITH VTE AS (
    SELECT *
    FROM (VALUES (1,'dp1',NULL,'phone1'),
                 (1,NULL,'pp1','phone1'),
                 (2,'dp2','pp2','phone2'),
                 (2,NULL,NULL,'phone4'),
                 (3,'dp3','pp2','phone3'),
                 (3,'dp3',NULL,'phone3')) V(id, dp_1, pp_1, phone))
--And the answer
SELECT id,
       MAX(dp_1) AS dp_1,
       MAX(pp_1) AS pp_1,
       STUFF((SELECT ' - ' + sq.phone 
              FROM VTE sq
              WHERE sq.id = VTE.id
                AND phone <> ''
              --GROUP BY sq.phone --If you only want to display unique phones, uncomment the GROUP BY.
              FOR XML PATH('')),1,3,'') AS [phone]
FROM VTE
GROUP BY id;

答案 1 :(得分:0)

此查询提供您的预期结果

;With cte( id,dp_1,pp_1,Phone)
AS
(            
 SELECT 1 ,  'dp1' , NULL   , 'phone1'   UNION ALL
 SELECT 1 ,   NULL , 'pp1'  , 'phone1'   UNION ALL
 SELECT 2 ,  'dp2' , 'pp2'  , 'phone2'   UNION ALL
 SELECT 2 ,   NULL ,  NULL  , 'phone4'   UNION ALL
 SELECT 3 ,  'dp3' , 'pp3'  , 'phone3'   UNION ALL
 SELECT 3 ,  'dp3' ,  NULL  , 'phone3'   
 )
 SELECT 
     DISTINCT id  , 
     MAX(dp_1)OVER(PARTITION BY id ORDER BY id) AS dp_1 ,
     MAX(pp_1)OVER(PARTITION BY id ORDER BY id) AS pp_1,
 STUFF((SELECT DISTINCT  ' - ' + Phone  FROM cte i WHERE i.id=o.id
FOR XML PATH ('')),1,2,'') AS Phone
FROM cte o

结果

id  dp_1     pp_1    Phone
--------------------------------
1   dp1      pp1     phone1 
2   dp2      pp2     phone2 - phone4
3   dp3      pp3     phone3 

答案 2 :(得分:0)

在Python中,您的最佳解决方案是pandas。我还使用numpy为您的案例中的“手机”选择唯一变量

首先,我只是创建你的表(我想从SQL读取是一个单独的问题)

df = pd.DataFrame(data={'id': [1, 1, 2, 2, 3, 3],
                        'dp_1': ['dp1', np.nan, 'dp2', np.nan, 'dp3', 'dp3'],
                        'pp_1': [np.nan, 'pp1', 'pp2', np.nan, 'pp3', np.nan],
                        'Phone': ['phone1 ', 'phone1 ', 'phone2 ', 'phone4 ', 'phone2 ', 'phone3 ']})

然后我创建一个将在分组中应用的函数

def unique_sum(str_list):
    return np.sum(np.unique(str_list))

然后应用groupby。我希望这就是你需要的东西

df.groupby('id').aggregate({'dp_1': 'last', 'pp_1': 'last', 'Phone': unique_sum})


    pp_1           Phone dp_1
id                          
1   pp1         phone1   dp1
2   pp2  phone2 phone4   dp2
3   pp3  phone2 phone3   dp3