我的输入文件类似于:
RefID|FirstName|MiddleName|LastName|SSN|DOB|School Year|Age|District LEA|District Description|School LEA|Location Description|title|frng_amt
1|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|P|014
2|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|S|13100
3|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|P|014
4|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|P|014
5|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|S|15000
6|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|S|13100
7|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|S|13100
8|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|P|014
9|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|P|014
10|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|S|13100
11|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|P|014
12|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|P|014
13|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|A|13100
14|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|S|15000
15|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|A|13100
16|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|P|014
我想进行数据匹配,在我的输出中,我想根据他们的SSN为Julie分配一个唯一的ID,为Shirley分配另一个唯一的ID。所以我的想法输出将是:
ID|RefID|FirstName|MiddleName|LastName|SSN|DOB|School Year|Age|District LEA|District Description|School LEA|Location Description|title|frng_amt
10001|1|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|2|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|S|13100
10001|3|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|4|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|5|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|S|15000
10001|6|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|7|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|8|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|9|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|10|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|11|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|12|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|13|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|A|13100
10002|14|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|S|15000
10002|15|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|A|13100
10002|16|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|P|014
如何在Python中完成此操作?我正在尝试使用if循环创建输入文件的副本,但我觉得这是一种非常低效和错误的实现方法。有人可以帮我找出办法吗?
我现在的代码:
inputReader = open(inputFile,'r')
inputReaderCopy = open(inputFile, 'r')
outputWriter = open(outputFile, 'w')
counter = 100000
headers = inputReader.readline()
for x in inputReader:
for y in inputReaderCopy:
if x.split("|")[4] == y.split("|")[4]:
outputWriter.write(str(counter) + "|" +y)
counter+=1
else:
outputWriter.write("no match|"+ y)
答案 0 :(得分:2)
只需保留使用dict映射唯一ID到每个SSN的SSN的记录,您只需要对这些行进行一次传递并使用csv module来解析将执行拆分的文件您。如果你想要一个全新的文件:
import csv
cn = 10001
with open("test.txt") as f, open("out.txt","w") as tmp:
r, wr = csv.reader(f, delimiter="|"), csv.writer(tmp, delimiter="|")
head, d = next(r), {}
wr.writerow(["ID"]+head)
for row in r:
v = row[4]
# if we have already seen the SSN, use the id assigned
if v in d:
wr.writerow([d[v]] + row)
else:
# else create new id, add pairing to dict and write
d[v] = cn
wr.writerow([cn] + row)
cn += 1
输出:
ID|RefID|FirstName|MiddleName|LastName|SSN|DOB|School Year|Age|District LEA|District Description|School LEA|Location Description|title|frng_amt
10001|1|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|2|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|S|13100
10001|3|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|4|JULIE|A|ADAMS|123456789|654321|20132014|46|0101000|DEWITTSCHOOLDISTRICT|P|014
10001|5|JULIE|A|ADAMS|123456789|654321|20142015|47|0101000|DEWITTSCHOOLDISTRICT|S|15000
10001|6|JULIE|A|ADAMS|123456789|654321|20122013|45|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|7|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|8|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|9|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|10|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|S|13100
10002|11|SHIRLEY||ADAMS|987654321|987890|20132014|50|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|12|SHIRLEY||ADAMS|987654321|987890|20122013|49|0101000|DEWITTSCHOOLDISTRICT|P|014
10002|13|SHIRLEY||ADAMS|987654321|987890|20102011|47|0101000|DEWITTSCHOOLDISTRICT|A|13100
10002|14|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|S|15000
10002|15|SHIRLEY||ADAMS|987654321|987890|20092010|46|0101000|DEWITTSCHOOLDISTRICT|A|13100
10002|16|SHIRLEY||ADAMS|987654321|987890|20142015|51|0101000|DEWITTSCHOOLDISTRICT|P|014
如果要更新原始文件,可以写入临时文件并执行shutil.move
:
import csv
from shutil import move
from tempfile import NamedTemporaryFile
import os
cn = 100001
try:
with open("test.txt") as f, NamedTemporaryFile("w", dir=".", delete=False) as tmp:
r, wr = csv.reader(f, delimiter="|"), csv.writer(tmp, delimiter="|")
head, d = next(r), {}
wr.writerow(["ID"] + head)
for row in r:
v = row[4]
if v in d:
wr.writerow([d[v]] + row)
else:
d[v] = cn
wr.writerow([cn] + row)
cn += 1
# replace original file
move(tmp.name, "test.txt"))
finally:
if os.path.isfile(tmp.name):
os.unlink(tmp.name)
如果您的数据实际上是按照您的输入进行排序,则可以groupby
:
import csv
from itertools import groupby
from operator import itemgetter
cn = 10001
with open("test.txt") as f, open("out.txt", "w") as tmp:
r, wr = csv.reader(f, delimiter="|"), csv.writer(tmp, delimiter="|")
head, d = next(r), {}
wr.writerow(["ID"] + head)
for k, v in groupby(r, key=itemgetter(4)):
wr.writerows([cn]+sub for sub in v)
cn += 1
答案 1 :(得分:0)
嗯,你已经拥有一个唯一的号码,它就是SSN。 你可以做的是创建一个SSN字典到唯一代码。
inputReader = open(inputFile,'r')
outputWriter = open(outputFile, 'w')
headers = inputReader.readline()
outputWriter.write("ID"+headers)
ssn_dict = {}
counter = 100000
for x in inputReader:
ssn_counter = ssn_dict.get(x.split("|")[4]
if ssn_count is not None:
outputWriter.write(str(ssn_count) + "|" + x)
else:
ssn_count[x.split("|")[4] = counter
counter =+ 1
outputWriter.write(str(counter) + "|" + x)
答案 2 :(得分:0)
您是否听说过pandas
?它可以帮到你!
import numpy as np
import pandas as pd
# Load data set
data = pd.read_csv(inputFile, delimiter='|')
# Tag
def func(ssn):
if ssn == 123456789:
return 10001
if ssn == 987654321:
return 10002
data['ID'] = data['SSN'].apply(func)
# Reorder columns
new_cols = np.concatenate((data.columns[-1:], data.columns[:-1]), axis=0)
data = data[new_cols]
# Save file
data.to_csv(outputFile, sep='|', index=False)
输出是:
ID|RefID|FirstName|MiddleName|LastName|SSN|DOB|School Year|Age|District LEA|District Description|School LEA|Location Description|title|frng_amt
10001|1|JULIE|A|ADAMS|123456789|654321|20142015|47|101000|DEWITTSCHOOLDISTRICT|P|14||
10001|2|JULIE|A|ADAMS|123456789|654321|20132014|46|101000|DEWITTSCHOOLDISTRICT|S|13100||
10001|3|JULIE|A|ADAMS|123456789|654321|20122013|45|101000|DEWITTSCHOOLDISTRICT|P|14||
10001|4|JULIE|A|ADAMS|123456789|654321|20132014|46|101000|DEWITTSCHOOLDISTRICT|P|14||
10001|5|JULIE|A|ADAMS|123456789|654321|20142015|47|101000|DEWITTSCHOOLDISTRICT|S|15000||
10001|6|JULIE|A|ADAMS|123456789|654321|20122013|45|101000|DEWITTSCHOOLDISTRICT|S|13100||
10002|7|SHIRLEY||ADAMS|987654321|987890|20122013|49|101000|DEWITTSCHOOLDISTRICT|S|13100||
10002|8|SHIRLEY||ADAMS|987654321|987890|20092010|46|101000|DEWITTSCHOOLDISTRICT|P|14||
10002|9|SHIRLEY||ADAMS|987654321|987890|20102011|47|101000|DEWITTSCHOOLDISTRICT|P|14||
10002|10|SHIRLEY||ADAMS|987654321|987890|20132014|50|101000|DEWITTSCHOOLDISTRICT|S|13100||
10002|11|SHIRLEY||ADAMS|987654321|987890|20132014|50|101000|DEWITTSCHOOLDISTRICT|P|14||
10002|12|SHIRLEY||ADAMS|987654321|987890|20122013|49|101000|DEWITTSCHOOLDISTRICT|P|14||
10002|13|SHIRLEY||ADAMS|987654321|987890|20102011|47|101000|DEWITTSCHOOLDISTRICT|A|13100||
10002|14|SHIRLEY||ADAMS|987654321|987890|20142015|51|101000|DEWITTSCHOOLDISTRICT|S|15000||
10002|15|SHIRLEY||ADAMS|987654321|987890|20092010|46|101000|DEWITTSCHOOLDISTRICT|A|13100||
10002|16|SHIRLEY||ADAMS|987654321|987890|20142015|51|101000|DEWITTSCHOOLDISTRICT|P|14||
<强>更新强>
正如Padraic Cunningham所讨论的,OP可能有两个以上SSN
。在这种情况下,bes解决方案将是:
import numpy as np
import pandas as pd
# Load data set
data = pd.read_csv(inputFile, delimiter='|')
# Tag
tag ={k:10001+k for i, k in enumerate(data['SSN'].unique())}
data['ID'] = data['SSN'].apply(lambda ssn: tag[ssn])
# Reorder columns
new_cols = np.concatenate((data.columns[-1:], data.columns[:-1]), axis=0)
data = data[new_cols]
# Save file
data.to_csv(outputFile, sep='|', index=False)
答案 3 :(得分:0)
处理表的最佳工具是pandas。你想做什么:
import pandas as pd
df = pd.read_csv('your input file path', sep='|')
df['ID'] = df['SSN'].rank(method='dense').astype(int) + 100000
df.to_csv('your output file path', sep='|', index=False)
输出(查看最后一栏):
RefID|FirstName|MiddleName|LastName|SSN|DOB|School Year|Age|District LEA|District Description|School LEA|Location Description|title|frng_amt|ID
1|JULIE|A|ADAMS|123456789|654321|20142015|47|101000|DEWITTSCHOOLDISTRICT|P|14|||100001
2|JULIE|A|ADAMS|123456789|654321|20132014|46|101000|DEWITTSCHOOLDISTRICT|S|13100|||100001
3|JULIE|A|ADAMS|123456789|654321|20122013|45|101000|DEWITTSCHOOLDISTRICT|P|14|||100001
4|JULIE|A|ADAMS|123456789|654321|20132014|46|101000|DEWITTSCHOOLDISTRICT|P|14|||100001
5|JULIE|A|ADAMS|123456789|654321|20142015|47|101000|DEWITTSCHOOLDISTRICT|S|15000|||100001
6|JULIE|A|ADAMS|123456789|654321|20122013|45|101000|DEWITTSCHOOLDISTRICT|S|13100|||100001
7|SHIRLEY||ADAMS|987654321|987890|20122013|49|101000|DEWITTSCHOOLDISTRICT|S|13100|||100002
8|SHIRLEY||ADAMS|987654321|987890|20092010|46|101000|DEWITTSCHOOLDISTRICT|P|14|||100002
9|SHIRLEY||ADAMS|987654321|987890|20102011|47|101000|DEWITTSCHOOLDISTRICT|P|14|||100002
10|SHIRLEY||ADAMS|987654321|987890|20132014|50|101000|DEWITTSCHOOLDISTRICT|S|13100|||100002
11|SHIRLEY||ADAMS|987654321|987890|20132014|50|101000|DEWITTSCHOOLDISTRICT|P|14|||100002
12|SHIRLEY||ADAMS|987654321|987890|20122013|49|101000|DEWITTSCHOOLDISTRICT|P|14|||100002
13|SHIRLEY||ADAMS|987654321|987890|20102011|47|101000|DEWITTSCHOOLDISTRICT|A|13100|||100002
14|SHIRLEY||ADAMS|987654321|987890|20142015|51|101000|DEWITTSCHOOLDISTRICT|S|15000|||100002
15|SHIRLEY||ADAMS|987654321|987890|20092010|46|101000|DEWITTSCHOOLDISTRICT|A|13100|||100002
16|SHIRLEY||ADAMS|987654321|987890|20142015|51|101000|DEWITTSCHOOLDISTRICT|P|14|||100002