我正在尝试使用其他来源提供的一些学生数据来更新csv文件,但是他们的csv数据格式与我们的格式略有不同。
它需要根据三个标准来匹配学生他们的名字,他们的班级,最后是该位置的前几个字母,所以B班的前几个学生来自Dumpt
,实际上是Dumpton Park。
找到匹配项
以下是一些示例数据:
Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,
Class A,York,Jim,x,x,10,
Class A,York,Sam,x,x,32,
Class B,Dumpton Park,Sarah,x,x,,
Class B,Dumpton Park,Bob,x,x,,
Class B,Dumpton Park,Bill,x,x,,
Class A,Dover,Andy,x,x,,
Class A,Dover,Hannah,x,x,,
Class B,London,Jemma,x,x,,
Class B,London,James,x,x,,
"Class","Location","Student","Scorecard","Number"
"Class A","York","Jim","0","742"
"Class A","York","Sam","0","931"
"Class A","York","Tom","0","653"
"Class B","Dumpt","Bob","23.1","299"
"Class B","Dumpt","Bill","23.4","198"
"Class B","Dumpt","Sarah","23.5","12"
"Class A","Dover","Andy","23","983"
"Class A","Dover","Hannah","1","293"
"Class B","Lond","Jemma","32.2","0"
"Class B","Lond","James","32.0","0"
Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,653
Class A,York,Jim,x,x,10,742
Class A,York,Sam,x,x,32,653
Class B,Dumpton Park,Sarah,x,x,23.5,12
Class B,Dumpton Park,Bob,x,x,23.1,299
Class B,Dumpton Park,Bill,x,x,23.4,198
Class A,Dover,Andy,x,x,23,983
Class A,Dover,Hannah,x,x,1,293
Class B,London,Jemma,x,x,32.2,
Class B,London,James,x,x,32.0,
我真的很感激这个问题的任何帮助。谢谢奥利弗
答案 0 :(得分:9)
以下是两个解决方案:熊猫解决方案和普通python解决方案。首先是一个大熊猫解决方案,不出所料看起来像其他大熊猫解决方案......
首先加载数据
import pandas
import numpy as np
cdf1 = pandas.read_csv('csv1',dtype=object) #dtype = object allows us to preserve the numeric formats
cdf2 = pandas.read_csv('csv2',dtype=object)
col_order = cdf1.columns #pandas will shuffle the column order at some point---this allows us to reset ot original column order
此时数据框看起来像
In [6]: cdf1
Out[6]:
Class Local Name DPE JJK Score No
0 Class A York Tom x x 32 NaN
1 Class A York Jim x x 10 NaN
2 Class A York Sam x x 32 NaN
3 Class B Dumpton Park Sarah x x NaN NaN
4 Class B Dumpton Park Bob x x NaN NaN
5 Class B Dumpton Park Bill x x NaN NaN
6 Class A Dover Andy x x NaN NaN
7 Class A Dover Hannah x x NaN NaN
8 Class B London Jemma x x NaN NaN
9 Class B London James x x NaN NaN
In [7]: cdf2
Out[7]:
Class Location Student Scorecard Number
0 Class A York Jim 0 742
1 Class A York Sam 0 931
2 Class A York Tom 0 653
3 Class B Dumpt Bob 23.1 299
4 Class B Dumpt Bill 23.4 198
5 Class B Dumpt Sarah 23.5 12
6 Class A Dover Andy 23 983
7 Class A Dover Hannah 1 293
8 Class B Lond Jemma 32.2 0
9 Class B Lond James 32.0 0
接下来将数据帧操作为匹配格式。
dcol = cdf2.Location
cdf2['Location'] = dcol.apply(lambda x: x[0:4]) #Replacement in cdf2 since we don't need original data
dcol = cdf1.Local
cdf1['Location'] = dcol.apply(lambda x: x[0:4]) #Here we add a column leaving 'Local' because we'll need it for the final output
cdf2 = cdf2.rename(columns={'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'})
cdf2 = cdf2.replace('0', np.nan) #Replacing '0' by np.nan means zeros don't overwrite
cdf1 = cdf1.set_index(['Class', 'Location', 'Name'])
cdf2 = cdf2.set_index(['Class', 'Location', 'Name'])
现在cdf1和cdf2看起来像
In [16]: cdf1
Out[16]:
Local DPE JJK Score No
Class Location Name
Class A York Tom York x x 32 NaN
Jim York x x 10 NaN
Sam York x x 32 NaN
Class B Dump Sarah Dumpton Park x x NaN NaN
Bob Dumpton Park x x NaN NaN
Bill Dumpton Park x x NaN NaN
Class A Dove Andy Dover x x NaN NaN
Hannah Dover x x NaN NaN
Class B Lond Jemma London x x NaN NaN
James London x x NaN NaN
In [17]: cdf2
Out[17]:
Score No
Class Location Name
Class A York Jim NaN 742
Sam NaN 931
Tom NaN 653
Class B Dump Bob 23.1 299
Bill 23.4 198
Sarah 23.5 12
Class A Dove Andy 23 983
Hannah 1 293
Class B Lond Jemma 32.2 NaN
James 32.0 NaN
使用cdf2
中的数据更新cdf1中的数据cdf1.update(cdf2, overwrite=False)
结果
In [19]: cdf1
Out[19]:
Local DPE JJK Score No
Class Location Name
Class A York Tom York x x 32 653
Jim York x x 10 742
Sam York x x 32 931
Class B Dump Sarah Dumpton Park x x 23.5 12
Bob Dumpton Park x x 23.1 299
Bill Dumpton Park x x 23.4 198
Class A Dove Andy Dover x x 23 983
Hannah Dover x x 1 293
Class B Lond Jemma London x x 32.2 NaN
James London x x 32.0 NaN
最后将cdf1返回到原始格式并将其写入csv文件。
cdf1 = cdf1.reset_index() #These two steps allow us to remove the 'Location' column
del cdf1['Location']
cdf1 = cdf1[col_order] #This will switch Local and Name back to their original order
cdf1.to_csv('temp.csv',index = False)
两个注释:首先,考虑到使用cdf1.Local.value_counts()或len(cdf1.Local.value_counts())等是多么容易。我强烈建议添加一些检查总结以确保从位置转移到位置的前几个字母时,您不会意外地删除位置。其次,我真诚地希望你所需输出的第4行有一个拼写错误。
在一个普通的python解决方案上。在下面,根据需要调整文件名。
#Open all of the necessary files
csv1 = open('csv1','r')
csv2 = open('csv2','r')
csvout = open('csv_out','w')
#Read past both headers and write the header to the outfile
wstr = csv1.readline()
csvout.write(wstr)
csv2.readline()
#Read csv1 into a dictionary with keys of Class,Name,and first four digits of Local and keep a list of keys for line ordering
line_keys = []
line_dict = {}
for line in csv1:
s = line.split(',')
this_key = (s[0],s[1][0:4],s[2])
line_dict[this_key] = s
line_keys.append(this_key)
#Go through csv2 updating the data in csv1 as necessary
for line in csv2:
s = line.replace('\"','').split(',')
this_key = (s[0],s[1][0:4],s[2])
if this_key in line_dict: #Lowers the crash rate...
#Check if need to replace Score...
if len(s[3]) > 0 and float(s[3]) != 0:
line_dict[this_key][5] = s[3]
#Check if need to repace No...
if len(s[4]) > 0 and float(s[4]) != 0:
line_dict[this_key][6] = s[4]
else:
print "Line not in csv1: %s"%line
#Write the updated line_dict to csvout
for key in line_keys:
wstr = ','.join(line_dict[key])
csvout.write(wstr)
csvout.write('\n')
#Close all of the open filehandles
csv1.close()
csv2.close()
csvout.close()
答案 1 :(得分:5)
您可以使用fuzzywuzzy来匹配城镇名称,并将其作为列附加到df2:
df1 = pd.read_csv(csv1)
df2 = pd.read_csv(csv2)
towns = df1.Local.unique() # assuming this is complete list of towns
from fuzzywuzzy.fuzz import partial_ratio
In [11]: df2['Local'] = df2.Location.apply(lambda short_location: max(towns, key=lambda t: partial_ratio(short_location, t)))
In [12]: df2
Out[12]:
Class Location Student Scorecard Number Local
0 Class A York Jim 0.0 742 York
1 Class A York Sam 0.0 931 York
2 Class A York Tom 0.0 653 York
3 Class B Dumpt Bob 23.1 299 Dumpton Park
4 Class B Dumpt Bill 23.4 198 Dumpton Park
5 Class B Dumpt Sarah 23.5 12 Dumpton Park
6 Class A Dover Andy 23.0 983 Dover
7 Class A Dover Hannah 1.0 293 Dover
8 Class B Lond Jemma 32.2 0 London
9 Class B Lond James 32.0 0 London
使名称保持一致(此时学生和姓名被错误命名):
In [13]: df2.rename_axis({'Student': 'Name'}, axis=1, inplace=True)
现在你可以merge(在重叠列上):
In [14]: res = df1.merge(df2, how='outer')
In [15]: res
Out[15]:
Class Local Name DPE JJK Score No Location Scorecard Number
0 Class A York Tom x x 32 NaN York 0.0 653
1 Class A York Jim x x 10 NaN York 0.0 742
2 Class A York Sam x x 32 NaN York 0.0 931
3 Class B Dumpton Park Sarah x x NaN NaN Dumpt 23.5 12
4 Class B Dumpton Park Bob x x NaN NaN Dumpt 23.1 299
5 Class B Dumpton Park Bill x x NaN NaN Dumpt 23.4 198
6 Class A Dover Andy x x NaN NaN Dover 23.0 983
7 Class A Dover Hannah x x NaN NaN Dover 1.0 293
8 Class B London Jemma x x NaN NaN Lond 32.2 0
9 Class B London James x x NaN NaN Lond 32.0 0
要清理的是分数,我想我会把两者中的最大值放在一边:
In [16]: res['Score'] = res.loc[:, ['Score', 'Scorecard']].max(1)
In [17]: del res['Scorecard']
del res['No']
del res['Location']
然后你留下了你想要的列:
In [18]: res
Out[18]:
Class Local Name DPE JJK Score Number
0 Class A York Tom x x 32.0 653
1 Class A York Jim x x 10.0 742
2 Class A York Sam x x 32.0 931
3 Class B Dumpton Park Sarah x x 23.5 12
4 Class B Dumpton Park Bob x x 23.1 299
5 Class B Dumpton Park Bill x x 23.4 198
6 Class A Dover Andy x x 23.0 983
7 Class A Dover Hannah x x 1.0 293
8 Class B London Jemma x x 32.2 0
9 Class B London James x x 32.0 0
In [18]: res.to_csv('foo.csv')
注意:要强制dtype为object(并且具有混合dtypes,int和浮点数,而不是所有浮点数),您可以使用apply。 如果您正在进行任何分析,我建议不要这样做!
res['Score'] = res['Score'].apply(lambda x: int(x) if int(x) == x else x, convert_dtype=False)
答案 2 :(得分:5)
希望这段代码更具可读性。 ;)Python's new Enum type is here的后端。
from enum import Enum # see PyPI for the backport (enum34)
class Field(Enum):
course = 0
location = 1
student = 2
dpe = 3
jjk = 4
score = -2
number = -1
def __index__(self):
return self._value_
def Float(text):
if not text:
return 0.0
return float(text)
def load_our_data(filename):
"return a dict using the first three fields as the key"
data = dict()
with open(filename) as input:
next(input) # throw away header
for line in input:
fields = line.strip('\n').split(',')
fields[Field.score] = Float(fields[Field.score])
fields[Field.number] = Float(fields[Field.number])
key = (
fields[Field.course].lower(),
fields[Field.location][:4].lower(),
fields[Field.student].lower(),
)
data[key] = fields
return data
def load_their_data(filename):
"return a dict using the first three fields as the key"
data = dict()
with open(filename) as input:
next(input) # throw away header
for line in input:
fields = line.strip('\n').split(',')
fields = [f.strip('"') for f in fields]
fields[Field.score] = Float(fields[Field.score])
fields[Field.number] = Float(fields[Field.number])
key = (
fields[Field.course].lower(),
fields[Field.location][:4].lower(),
fields[Field.student].lower(),
)
data[key] = fields
return data
def merge_data(ours, theirs):
"their data is only used if not blank and non-zero"
for key, our_data in ours.items():
their_data = theirs[key]
if their_data[Field.score]:
our_data[Field.score] = their_data[Field.score]
if their_data[Field.number]:
our_data[Field.number] = their_data[Field.number]
def write_our_data(data, filename):
with open(filename, 'w') as output:
for record in sorted(data.values()):
line = ','.join([str(f) for f in record])
output.write(line + '\n')
if __name__ == '__main__':
ours = load_our_data('one.csv')
theirs = load_their_data('two.csv')
merge_data(ours, theirs)
write_our_data(ours, 'three.csv')
答案 3 :(得分:4)
Python词典是这里的方法:
studentDict = {}
with open(<csv1>, 'r') as f:
for line in f:
LL = line.rstrip('\n').replace('"','').split(',')
studentDict[LL[0], LL[1], LL[2]] = LL[3:]
with open(<csv2>, 'r') as f:
for line in f:
LL = line.rstrip('\n').replace('"','').split(',')
if LL[-2] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-2] = LL[-2]
if LL[-1] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-1] = LL[-1]
with open(<outFile>, 'w') as f:
for k in studentDict.keys():
v = studentDict[k[0], k[1], k[2]]
f.write(k[0] + ',' + k[1] + ',' + k[2] + ',' + v[0] + ',' + v[1] + ',' + v[2] + ',' + v[3] + '\n')
答案 4 :(得分:4)
pandas使这类任务更方便。
编辑:好的,因为你不能依赖手动重命名列,罗马的建议只是匹配前几个字母是一个很好的。我们不得不在此之前改变一些事情。
In [62]: df1 = pd.read_clipboard(sep=',')
In [63]: df2 = pd.read_clipboard(sep=',')
In [68]: df1
Out[68]:
Class Location Student Scorecard Number
0 Class A York Jim 0.0 742
1 Class A York Sam 0.0 931
2 Class A York Tom 0.0 653
3 Class B Dumpt Bob 23.1 299
4 Class B Dumpt Bill 23.4 198
5 Class B Dumpt Sarah 23.5 12
6 Class A Dover Andy 23.0 983
7 Class A Dover Hannah 1.0 293
8 Class B Lond Jemma 32.2 0
9 Class B Lond James 32.0 0
In [69]: df2
Out[69]:
Class Local Name DPE JJK Score No
0 Class A York Tom x x 32.0 653
1 Class A York Jim x x 10.0 742
2 Class A York Sam x x 32.0 653
3 Class B Dumpton Park Sarah x x 23.5 12
4 Class B Dumpton Park Bob x x 23.1 299
5 Class B Dumpton Park Bill x x 23.4 198
6 Class A Dover Andy x x 23.0 983
7 Class A Dover Hannah x x 1.0 293
8 Class B London Jemma x x 32.2 NaN
9 Class B London James x x 32.0 NaN
获取名称相同的列。
In [70]: df1 = df1.rename(columns={'Location': 'Local', 'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'}
现在的位置。将原件df2
保存到单独的系列中。
In [71]: locations = df2['Local']
In [72]: df1['Local'] = df1['Local'].str.slice(0, 4)
In [73]: df2['Local'] = df2['Local'].str.slice(0, 4)
使用字符串方法截断到前4个(假设这不会导致任何错误匹配)。
现在设置索引:
In [78]: df1 = df1.set_index(['Class', 'Local', 'Name'])
In [79]: df2 = df2.set_index(['Class', 'Local', 'Name'])
In [80]: df1
Out[80]:
Score No
Class Local Name
Class A York Jim 0.0 742
Sam 0.0 931
Tom 0.0 653
Class B Dump Bob 23.1 299
Bill 23.4 198
Sarah 23.5 12
Class A Dove Andy 23.0 983
Hannah 1.0 293
Class B Lond Jemma 32.2 0
James 32.0 0
In [83]: df1 = df1.replace(0, np.nan)
In [84]: df2 = df2.replace(0, np.nan)
最后,像以前一样更新分数:
In [85]: df1.update(df2, overwrite=False)
您可以通过以下方式获取原始位置:
In [91]: df1 = df1.reset_index()
In [92]: df1['Local'] = locations
您可以使用df1.to_csv('path/to/csv')
答案 5 :(得分:2)
您可以尝试使用标准库中的csv模块。我的解决方案与Chris H非常相似,但我使用csv模块来读写文件。 (事实上,我偷了他将密钥存储在列表中以保存订单的技术。)
如果使用csv模块,则不必过多担心引号,它还允许您直接将行读入列中,并将列名称作为键。
import csv
# Open first CSV, and read each line as a dictionary with column names as keys.
with open('csv1.csv', 'rb') as csvfile1:
table1 = csv.DictReader(csvfile1,['Class', 'Local', 'Name',
'DPE', 'JJK', 'Score', 'No'])
table1.next() #skip header row
first_table = {}
original_order = [] #list keys to save original order
# build dictionary of rows with name, location, and class as keys
for row in table1:
id = "%s from %s in %s" % (row['Name'], row['Local'][:4], row['Class'])
first_table[id] = row
original_order.append(id)
# Repeat for second csv, but don't worry about order
with open('csv2.csv', 'rb') as csvfile2:
table2 = csv.DictReader(csvfile2, ['Class', 'Location',
'Student', 'Scorecard', 'Number'])
table2.next()
second_table = {}
for row in table2:
id = "%s from %s in %s" % (row['Student'], row['Location'][:4], row['Class'])
second_table[id] = row
with open('student_data.csv', 'wb') as finalfile:
results = csv.DictWriter(finalfile, ['Class', 'Local', 'Name',
'DPE', 'JJK', 'Score', 'No'])
results.writeheader()
# Replace data in first csv with data in second csv when conditions are satisfied.
for student in original_order:
if second_table[student]['Scorecard'] != "0" and second_table[student]['Scorecard'] != "":
first_table[student]['Score'] = second_table[student]['Scorecard']
if second_table[student]['Number'] != "0" and second_table[student]['Number'] != "":
first_table[student]['No'] = second_table[student]['Number']
results.writerow(first_table[student])
希望这有帮助。