删除CSV中的重复行,仅使用Python比较两列中的数据

时间:2015-09-25 22:45:12

标签: python database parsing csv python-3.x

可能有很多方法可以解决这个问题,但是当涉及到它时,这里有一个要点:

我有两个人数据库,都导出到csv文件中。其中一个数据库正在退役。 我需要比较每个csv文件(或两者的组合版本)并过滤掉即将退役的服务器中的所有非唯一人。这样,我只能将已解除授权的数据库中的唯一人员导入当前数据库。

我只需要比较FirstName和LastName(它们是两个独立的列)。部分问题是它们不是精确重复,名称全部大写在一个数据库中,而在另一个数据库中大写。

以下是将两个csv文件合并为一个数据的示例。所有CAPS名称都来自当前数据库(这是csv当前格式化的方式):

FirstName,LastName,id,id2,id3
John,Doe,123,432,645
Jacob,Smith,456,372,383
Susy,Saucy,9999,12,8r83
Contractor ,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
JOHN,DOE,999,888,999
SUSY,SAUCY,8373,08j,9023

将被解析为:

Jacob,Smith,456,372,383
Contractor,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283

解析其他列无关紧要,但显然数据非常相关,因此必须保持不变。 (实际上有几十个其他专栏,而不仅仅是三个专栏。)

为了了解我实际拥有多少重复项,我运行了这个脚本(取自上一篇文章):

with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate

        seen.add(line)
        out_file.write(line)

虽然我的需求太简单了。

5 个答案:

答案 0 :(得分:1)

你可以使用pandas包来实现这个

C:\Users\alfred.myers\Documents\Visual Studio 2015\Projects\ConsoleApplication46\ConsoleApplication46\bin\Debug\mscorlib.pdb: Cannot find or open the PDB file.
C:\Windows\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.pdb: Cannot find or open the PDB file.
C:\Windows\mscorlib.pdb: Cannot find or open the PDB file.
C:\Windows\symbols\dll\mscorlib.pdb: Cannot find or open the PDB file.
C:\Windows\dll\mscorlib.pdb: Cannot find or open the PDB file.
C:\SymbolCache\mscorlib.pdb\bad2b04f3cb34c248fbd3f1e9dcf60202\mscorlib.pdb: Cannot find or open the PDB file.
C:\SymbolCache\MicrosoftPublicSymbols\mscorlib.pdb\bad2b04f3cb34c248fbd3f1e9dcf60202\mscorlib.pdb: Cannot find or open the PDB file.
SYMSRV:  C:\SymbolCache\mscorlib.pdb\BAD2B04F3CB34C248FBD3F1E9DCF60202\mscorlib.pdb - file not found


*** ERROR: HTTP_STATUS_NOT_FOUND


*** ERROR: HTTP_STATUS_NOT_FOUND


*** ERROR: HTTP_STATUS_NOT_FOUND


*** ERROR: HTTP_STATUS_NOT_FOUND


SYMSRV:  C:\SymbolCache\mscorlib.pdb\BAD2B04F3CB34C248FBD3F1E9DCF60202\mscorlib.pdb not found


SYMSRV:  http://referencesource.microsoft.com/symbols/mscorlib.pdb/BAD2B04F3CB34C248FBD3F1E9DCF60202/mscorlib.pdb not found


http://referencesource.microsoft.com/symbols: Symbols not found on symbol server.
SYMSRV:  C:\SymbolCache\mscorlib.pdb\BAD2B04F3CB34C248FBD3F1E9DCF60202\mscorlib.pdb - file not found


*** ERROR: HTTP_STATUS_NOT_FOUND


*** ERROR: HTTP_STATUS_NOT_FOUND


INFO:  HTTP_STATUS_OK


SYMSRV:  C:\SymbolCache\mscorlib.pdb\BAD2B04F3CB34C248FBD3F1E9DCF60202\mscorlib.pdb - file not found


SYMSRV:  mscorlib.pdb from https://msdl.microsoft.com/download/symbols: 121231 bytes 

https://msdl.microsoft.com/download/symbols: Symbols downloaded from symbol server.
C:\SymbolCache\mscorlib.pdb\BAD2B04F3CB34C248FBD3F1E9DCF60202\mscorlib.pdb: Symbols loaded.

将StringIO替换为csv文件的路径

import pandas as pd
import StringIO

连接和大写名称

df1 = pd.read_table(StringIO.StringIO('''FirstName    LastName    id     id2    id3
John         Doe         123    432    645
Jacob        Smith       456    372    383
Susy         Saucy       9999   12     8r83
Contractor   #1          8dh    28j    153s
Testing2     Contrator   7463   99999  0283'''), delim_whitespace=True)

df2 = pd.read_table(StringIO.StringIO('''FirstName    LastName    id     id2    id3
JOHN         DOE         999    888    999
SUSY         SAUCY       8373   08j    9023'''), delim_whitespace=True)

从df1中选择与df2

中的名称不匹配的行
df1['name'] = (df1.FirstName + df1.LastName).str.upper()
df2['name'] = (df2.FirstName + df2.LastName).str.upper()

答案 1 :(得分:1)

使用集合是没有用的,除非你真的想要保留一个具有重复值的唯一行,不仅保留唯一的行,你需要找到查找所有文件的唯一值Counter dict会做:

with open("test.csv", encoding="utf-8") as f, open("file_out.csv", "w") as out:
    from collections import Counter
    from csv import reader, writer
    wr = writer(out)
    header = next(f) # get header
    # get count of each first/last name pair lowering each string
    counts = Counter((a.lower(), b.lower()) for a, b, *_ in reader(f))
    f.seek(0) # reset counter 
    out.write(next(f))  # write header ?
    # iterate over the file again, only keeping rows which have
    # unique first and second names
    wr.writerows(row for row in reader(f)
                   if counts[row[0].lower(),row[1].lower()] == 1)

输入:

FirstName,LastName,id,id2,id3
John,Doe,123,432,645
Jacob,Smith,456,372,383
Susy,Saucy,9999,12,8r83
Contractor,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
JOHN,DOE,999,888,999
SUSY,SAUCY,8373,08j,9023

file_out:

FirstName,LastName,id,id2,id3
Jacob,Smith,456,372,383
Contractor,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283

counts计算降低后每个名称出现的次数。然后我们重置指针,只写入前两个列值只在整个文件中看到一次的行。

或者没有csv模块,如果你有namy列可能会更快:

with open("test.csv") as f, open("file_out.csv","w") as out:
    from collections import Counter
    header = next(f) # get header
    next(f) # skip blank line
    counts = Counter(tuple(map(str.lower,line.split(",", 2)[:2])) for line in f)
    f.seek(0) # back to start of file
    next(f), next(f) # skip again
    out.write(header) # write original header ?
    out.writelines(line for line in  f
                   if counts[map(str.lower,line.split(",", 2)[:2])] == 1)

答案 2 :(得分:0)

您可以保持使用集合的想法。只需定义一个将返回您感兴趣的函数:

def name(line):
    line = line.split(',')
    n = ' '.join(line[:2])
    return n.lower()

如果不连接两个数据库,请将当前数据库中的名称读入一个集合。

with open('current.csv') as f:
    next(f)
    current_db = {name(line) for line in f}

检查已停用的数据库中的名称,如果没有看到,则写入它们。

with open('decommissioned.csv') as old, open('unique.csv', 'w') as out:
    next(old)
    for line in old:
        if name(line) not in current_db:
            out.write(line)

答案 3 :(得分:-1)

您需要对名称无关紧要的句子进行操作。例如:

with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        field_list = line.split(' ')
        key_name = (field_list[0] + "_" + filed_list[1]).lower()
        if key_name in seen: continue # skip duplicate

        seen.add(key_name)
        out_file.write(line)

答案 4 :(得分:-1)

changed since it is in csv format 

    from collections import defaultdict

    dd = defaultdict(list)
    d = {}

    import re

with open("data") as f:
    for line in f:
        line = line.strip().lower()
        mobj =  re.match('(\w+),(\w+|#\d),(.*)',line)
        firstf, secondf, rest = mobj.groups()
        key = firstf + "_" + secondf
        d[key] = rest
        dd[key].append(rest)


    for k, v in d.items():
        print(k, v)

<强>输出

jacob_smith 456,372,383

testing2_contrator 7463,99999,0283

john_doe 999,888,999

susy_saucy 8373,08j,9023

contractor_#1 8dh,28j,153s

jacob_smith 456 372 383

输出

for k, v in dd.items():
    print(k,v)


jacob_smith ['456,372,383']

testing2_contrator ['7463,99999,0283']

john_doe ['123,432,645', '999,888,999']

susy_saucy ['9999,12,8r83', '8373,08j,9023']

contractor_#1 ['8dh,28j,153s']