可能有很多方法可以解决这个问题,但是当涉及到它时,这里有一个要点:
我有两个人数据库,都导出到csv文件中。其中一个数据库正在退役。 我需要比较每个csv文件(或两者的组合版本)并过滤掉即将退役的服务器中的所有非唯一人。这样,我只能将已解除授权的数据库中的唯一人员导入当前数据库。
我只需要比较FirstName和LastName(它们是两个独立的列)。部分问题是它们不是精确重复,名称全部大写在一个数据库中,而在另一个数据库中大写。
以下是将两个csv文件合并为一个数据的示例。所有CAPS名称都来自当前数据库(这是csv当前格式化的方式):
FirstName,LastName,id,id2,id3
John,Doe,123,432,645
Jacob,Smith,456,372,383
Susy,Saucy,9999,12,8r83
Contractor ,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
JOHN,DOE,999,888,999
SUSY,SAUCY,8373,08j,9023
将被解析为:
Jacob,Smith,456,372,383
Contractor,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
解析其他列无关紧要,但显然数据非常相关,因此必须保持不变。 (实际上有几十个其他专栏,而不仅仅是三个专栏。)
为了了解我实际拥有多少重复项,我运行了这个脚本(取自上一篇文章):
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
虽然我的需求太简单了。
答案 0 :(得分:1)
你可以使用pandas包来实现这个
C:\Users\alfred.myers\Documents\Visual Studio 2015\Projects\ConsoleApplication46\ConsoleApplication46\bin\Debug\mscorlib.pdb: Cannot find or open the PDB file.
C:\Windows\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.pdb: Cannot find or open the PDB file.
C:\Windows\mscorlib.pdb: Cannot find or open the PDB file.
C:\Windows\symbols\dll\mscorlib.pdb: Cannot find or open the PDB file.
C:\Windows\dll\mscorlib.pdb: Cannot find or open the PDB file.
C:\SymbolCache\mscorlib.pdb\bad2b04f3cb34c248fbd3f1e9dcf60202\mscorlib.pdb: Cannot find or open the PDB file.
C:\SymbolCache\MicrosoftPublicSymbols\mscorlib.pdb\bad2b04f3cb34c248fbd3f1e9dcf60202\mscorlib.pdb: Cannot find or open the PDB file.
SYMSRV: C:\SymbolCache\mscorlib.pdb\BAD2B04F3CB34C248FBD3F1E9DCF60202\mscorlib.pdb - file not found
*** ERROR: HTTP_STATUS_NOT_FOUND
*** ERROR: HTTP_STATUS_NOT_FOUND
*** ERROR: HTTP_STATUS_NOT_FOUND
*** ERROR: HTTP_STATUS_NOT_FOUND
SYMSRV: C:\SymbolCache\mscorlib.pdb\BAD2B04F3CB34C248FBD3F1E9DCF60202\mscorlib.pdb not found
SYMSRV: http://referencesource.microsoft.com/symbols/mscorlib.pdb/BAD2B04F3CB34C248FBD3F1E9DCF60202/mscorlib.pdb not found
http://referencesource.microsoft.com/symbols: Symbols not found on symbol server.
SYMSRV: C:\SymbolCache\mscorlib.pdb\BAD2B04F3CB34C248FBD3F1E9DCF60202\mscorlib.pdb - file not found
*** ERROR: HTTP_STATUS_NOT_FOUND
*** ERROR: HTTP_STATUS_NOT_FOUND
INFO: HTTP_STATUS_OK
SYMSRV: C:\SymbolCache\mscorlib.pdb\BAD2B04F3CB34C248FBD3F1E9DCF60202\mscorlib.pdb - file not found
SYMSRV: mscorlib.pdb from https://msdl.microsoft.com/download/symbols: 121231 bytes
https://msdl.microsoft.com/download/symbols: Symbols downloaded from symbol server.
C:\SymbolCache\mscorlib.pdb\BAD2B04F3CB34C248FBD3F1E9DCF60202\mscorlib.pdb: Symbols loaded.
将StringIO替换为csv文件的路径
import pandas as pd
import StringIO
连接和大写名称
df1 = pd.read_table(StringIO.StringIO('''FirstName LastName id id2 id3
John Doe 123 432 645
Jacob Smith 456 372 383
Susy Saucy 9999 12 8r83
Contractor #1 8dh 28j 153s
Testing2 Contrator 7463 99999 0283'''), delim_whitespace=True)
df2 = pd.read_table(StringIO.StringIO('''FirstName LastName id id2 id3
JOHN DOE 999 888 999
SUSY SAUCY 8373 08j 9023'''), delim_whitespace=True)
从df1中选择与df2
中的名称不匹配的行df1['name'] = (df1.FirstName + df1.LastName).str.upper()
df2['name'] = (df2.FirstName + df2.LastName).str.upper()
答案 1 :(得分:1)
使用集合是没有用的,除非你真的想要保留一个具有重复值的唯一行,不仅保留唯一的行,你需要找到查找所有文件的唯一值Counter
dict会做:
with open("test.csv", encoding="utf-8") as f, open("file_out.csv", "w") as out:
from collections import Counter
from csv import reader, writer
wr = writer(out)
header = next(f) # get header
# get count of each first/last name pair lowering each string
counts = Counter((a.lower(), b.lower()) for a, b, *_ in reader(f))
f.seek(0) # reset counter
out.write(next(f)) # write header ?
# iterate over the file again, only keeping rows which have
# unique first and second names
wr.writerows(row for row in reader(f)
if counts[row[0].lower(),row[1].lower()] == 1)
输入:
FirstName,LastName,id,id2,id3
John,Doe,123,432,645
Jacob,Smith,456,372,383
Susy,Saucy,9999,12,8r83
Contractor,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
JOHN,DOE,999,888,999
SUSY,SAUCY,8373,08j,9023
file_out:
FirstName,LastName,id,id2,id3
Jacob,Smith,456,372,383
Contractor,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
counts
计算降低后每个名称出现的次数。然后我们重置指针,只写入前两个列值只在整个文件中看到一次的行。
或者没有csv模块,如果你有namy列可能会更快:
with open("test.csv") as f, open("file_out.csv","w") as out:
from collections import Counter
header = next(f) # get header
next(f) # skip blank line
counts = Counter(tuple(map(str.lower,line.split(",", 2)[:2])) for line in f)
f.seek(0) # back to start of file
next(f), next(f) # skip again
out.write(header) # write original header ?
out.writelines(line for line in f
if counts[map(str.lower,line.split(",", 2)[:2])] == 1)
答案 2 :(得分:0)
您可以保持使用集合的想法。只需定义一个将返回您感兴趣的函数:
def name(line):
line = line.split(',')
n = ' '.join(line[:2])
return n.lower()
如果不连接两个数据库,请将当前数据库中的名称读入一个集合。
with open('current.csv') as f:
next(f)
current_db = {name(line) for line in f}
检查已停用的数据库中的名称,如果没有看到,则写入它们。
with open('decommissioned.csv') as old, open('unique.csv', 'w') as out:
next(old)
for line in old:
if name(line) not in current_db:
out.write(line)
答案 3 :(得分:-1)
您需要对名称无关紧要的句子进行操作。例如:
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
field_list = line.split(' ')
key_name = (field_list[0] + "_" + filed_list[1]).lower()
if key_name in seen: continue # skip duplicate
seen.add(key_name)
out_file.write(line)
答案 4 :(得分:-1)
changed since it is in csv format
from collections import defaultdict
dd = defaultdict(list)
d = {}
import re
with open("data") as f:
for line in f:
line = line.strip().lower()
mobj = re.match('(\w+),(\w+|#\d),(.*)',line)
firstf, secondf, rest = mobj.groups()
key = firstf + "_" + secondf
d[key] = rest
dd[key].append(rest)
for k, v in d.items():
print(k, v)
<强>输出强>
jacob_smith 456,372,383
testing2_contrator 7463,99999,0283
john_doe 999,888,999
susy_saucy 8373,08j,9023
contractor_#1 8dh,28j,153s
jacob_smith 456 372 383
输出
for k, v in dd.items():
print(k,v)
jacob_smith ['456,372,383']
testing2_contrator ['7463,99999,0283']
john_doe ['123,432,645', '999,888,999']
susy_saucy ['9999,12,8r83', '8373,08j,9023']
contractor_#1 ['8dh,28j,153s']