使用具有不同格式的csv中的数据更新csv

时间:2013-11-10 16:28:50

标签: python python-2.7 csv pandas

我正在尝试使用其他来源提供的一些学生数据来更新csv文件,但是他们的csv数据格式与我们的格式略有不同。

它需要根据三个标准来匹配学生他们的名字,他们的班级,最后是该位置的前几个字母,所以B班的前几个学生来自Dumpt,实际上是Dumpton Park。

找到匹配项

  • 如果CSV 2中学生的记分卡为0或空白,则不应更新CSV 1中的分数列
  • 如果学生的CSV 2中的数字为0或空白,则不应更新CSV 1中的“否”列
  • 否则应将CSV 2中的数字导入CSV1

以下是一些示例数据:

CSV 1

Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,
Class A,York,Jim,x,x,10,
Class A,York,Sam,x,x,32,
Class B,Dumpton Park,Sarah,x,x,,
Class B,Dumpton Park,Bob,x,x,,
Class B,Dumpton Park,Bill,x,x,,
Class A,Dover,Andy,x,x,,
Class A,Dover,Hannah,x,x,,
Class B,London,Jemma,x,x,,
Class B,London,James,x,x,,

CSV 2

"Class","Location","Student","Scorecard","Number"
"Class A","York","Jim","0","742"
"Class A","York","Sam","0","931"
"Class A","York","Tom","0","653"
"Class B","Dumpt","Bob","23.1","299"
"Class B","Dumpt","Bill","23.4","198"
"Class B","Dumpt","Sarah","23.5","12"
"Class A","Dover","Andy","23","983"
"Class A","Dover","Hannah","1","293"
"Class B","Lond","Jemma","32.2","0"
"Class B","Lond","James","32.0","0"

CSV 1 UPDATED(这是所需的输出)

Class,Local,Name,DPE,JJK,Score,No
Class A,York,Tom,x,x,32,653
Class A,York,Jim,x,x,10,742
Class A,York,Sam,x,x,32,653
Class B,Dumpton Park,Sarah,x,x,23.5,12
Class B,Dumpton Park,Bob,x,x,23.1,299
Class B,Dumpton Park,Bill,x,x,23.4,198
Class A,Dover,Andy,x,x,23,983
Class A,Dover,Hannah,x,x,1,293
Class B,London,Jemma,x,x,32.2,
Class B,London,James,x,x,32.0,

我真的很感激这个问题的任何帮助。谢谢奥利弗

6 个答案:

答案 0 :(得分:9)

以下是两个解决方案:熊猫解决方案和普通python解决方案。首先是一个大熊猫解决方案,不出所料看起来像其他大熊猫解决方案......

首先加载数据

import pandas
import numpy as np

cdf1 = pandas.read_csv('csv1',dtype=object)  #dtype = object allows us to preserve the numeric formats
cdf2 = pandas.read_csv('csv2',dtype=object)

col_order = cdf1.columns  #pandas will shuffle the column order at some point---this allows us to reset ot original column order

此时数据框看起来像

In [6]: cdf1
Out[6]: 
     Class         Local    Name DPE JJK Score   No
0  Class A          York     Tom   x   x    32  NaN
1  Class A          York     Jim   x   x    10  NaN
2  Class A          York     Sam   x   x    32  NaN
3  Class B  Dumpton Park   Sarah   x   x   NaN  NaN
4  Class B  Dumpton Park     Bob   x   x   NaN  NaN
5  Class B  Dumpton Park    Bill   x   x   NaN  NaN
6  Class A         Dover    Andy   x   x   NaN  NaN
7  Class A         Dover  Hannah   x   x   NaN  NaN
8  Class B        London   Jemma   x   x   NaN  NaN
9  Class B        London   James   x   x   NaN  NaN

In [7]: cdf2
Out[7]: 
     Class Location Student Scorecard Number
0  Class A     York     Jim         0    742
1  Class A     York     Sam         0    931
2  Class A     York     Tom         0    653
3  Class B    Dumpt     Bob      23.1    299
4  Class B    Dumpt    Bill      23.4    198
5  Class B    Dumpt   Sarah      23.5     12
6  Class A    Dover    Andy        23    983
7  Class A    Dover  Hannah         1    293
8  Class B     Lond   Jemma      32.2      0
9  Class B     Lond   James      32.0      0

接下来将数据帧操作为匹配格式。

dcol = cdf2.Location 
cdf2['Location'] = dcol.apply(lambda x: x[0:4])  #Replacement in cdf2 since we don't need original data

dcol = cdf1.Local
cdf1['Location'] = dcol.apply(lambda x: x[0:4])  #Here we add a column leaving 'Local' because we'll need it for the final output

cdf2 = cdf2.rename(columns={'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'})
cdf2 = cdf2.replace('0', np.nan)  #Replacing '0' by np.nan means zeros don't overwrite

cdf1 = cdf1.set_index(['Class', 'Location', 'Name'])
cdf2 = cdf2.set_index(['Class', 'Location', 'Name'])

现在cdf1和cdf2看起来像

In [16]: cdf1
Out[16]: 
                                Local DPE JJK Score   No
Class   Location Name                                   
Class A York     Tom             York   x   x    32  NaN
                 Jim             York   x   x    10  NaN
                 Sam             York   x   x    32  NaN
Class B Dump     Sarah   Dumpton Park   x   x   NaN  NaN
                 Bob     Dumpton Park   x   x   NaN  NaN
                 Bill    Dumpton Park   x   x   NaN  NaN
Class A Dove     Andy           Dover   x   x   NaN  NaN
                 Hannah         Dover   x   x   NaN  NaN
Class B Lond     Jemma         London   x   x   NaN  NaN
                 James         London   x   x   NaN  NaN

In [17]: cdf2
Out[17]: 
                        Score   No
Class   Location Name             
Class A York     Jim      NaN  742
                 Sam      NaN  931
                 Tom      NaN  653
Class B Dump     Bob     23.1  299
                 Bill    23.4  198
                 Sarah   23.5   12
Class A Dove     Andy      23  983
                 Hannah     1  293
Class B Lond     Jemma   32.2  NaN
                 James   32.0  NaN

使用cdf2

中的数据更新cdf1中的数据
cdf1.update(cdf2, overwrite=False)

结果

In [19]: cdf1
Out[19]: 
                                Local DPE JJK Score   No
Class   Location Name                                   
Class A York     Tom             York   x   x    32  653
                 Jim             York   x   x    10  742
                 Sam             York   x   x    32  931
Class B Dump     Sarah   Dumpton Park   x   x  23.5   12
                 Bob     Dumpton Park   x   x  23.1  299
                 Bill    Dumpton Park   x   x  23.4  198
Class A Dove     Andy           Dover   x   x    23  983
                 Hannah         Dover   x   x     1  293
Class B Lond     Jemma         London   x   x  32.2  NaN
                 James         London   x   x  32.0  NaN

最后将cdf1返回到原始格式并将其写入csv文件。

cdf1 = cdf1.reset_index()  #These two steps allow us to remove the 'Location' column
del cdf1['Location']    
cdf1 = cdf1[col_order]     #This will switch Local and Name back to their original order

cdf1.to_csv('temp.csv',index = False)

两个注释:首先,考虑到使用cdf1.Local.value_counts()或len(cdf1.Local.value_counts())等是多么容易。我强烈建议添加一些检查总结以确保从位置转移到位置的前几个字母时,您不会意外地删除位置。其次,我真诚地希望你所需输出的第4行有一个拼写错误。

在一个普通的python解决方案上。在下面,根据需要调整文件名。

#Open all of the necessary files
csv1 = open('csv1','r')
csv2 = open('csv2','r')
csvout = open('csv_out','w')

#Read past both headers and write the header to the outfile
wstr = csv1.readline()
csvout.write(wstr)
csv2.readline()

#Read csv1 into a dictionary with keys of Class,Name,and first four digits of Local and keep a list of keys for line ordering
line_keys = []
line_dict = {}
for line in csv1:
    s = line.split(',')
    this_key = (s[0],s[1][0:4],s[2])
    line_dict[this_key] = s
    line_keys.append(this_key)

#Go through csv2 updating the data in csv1 as necessary
for line in csv2:
    s = line.replace('\"','').split(',')
    this_key = (s[0],s[1][0:4],s[2])
    if this_key in line_dict:   #Lowers the crash rate...
        #Check if need to replace Score...
        if len(s[3]) > 0 and float(s[3]) != 0:
            line_dict[this_key][5] = s[3]
        #Check if need to repace No...
        if len(s[4]) > 0 and float(s[4]) != 0:
            line_dict[this_key][6] = s[4]
    else:
        print "Line not in csv1: %s"%line

#Write the updated line_dict to csvout
for key in line_keys:
    wstr = ','.join(line_dict[key])
    csvout.write(wstr)
csvout.write('\n')

#Close all of the open filehandles
csv1.close()
csv2.close()
csvout.close()

答案 1 :(得分:5)

您可以使用fuzzywuzzy来匹配城镇名称,并将其作为列附加到df2:

df1 = pd.read_csv(csv1)
df2 = pd.read_csv(csv2)

towns = df1.Local.unique()  # assuming this is complete list of towns

from fuzzywuzzy.fuzz import partial_ratio

In [11]: df2['Local'] =  df2.Location.apply(lambda short_location: max(towns, key=lambda t: partial_ratio(short_location, t)))

In [12]: df2
Out[12]: 
     Class Location Student  Scorecard  Number         Local
0  Class A     York     Jim        0.0     742          York
1  Class A     York     Sam        0.0     931          York
2  Class A     York     Tom        0.0     653          York
3  Class B    Dumpt     Bob       23.1     299  Dumpton Park
4  Class B    Dumpt    Bill       23.4     198  Dumpton Park
5  Class B    Dumpt   Sarah       23.5      12  Dumpton Park
6  Class A    Dover    Andy       23.0     983         Dover
7  Class A    Dover  Hannah        1.0     293         Dover
8  Class B     Lond   Jemma       32.2       0        London
9  Class B     Lond   James       32.0       0        London

使名称保持一致(此时学生和姓名被错误命名):

In [13]: df2.rename_axis({'Student': 'Name'}, axis=1, inplace=True)

现在你可以merge(在重叠列上):

In [14]: res = df1.merge(df2, how='outer')

In [15]: res
Out[15]: 
     Class         Local    Name DPE JJK  Score  No Location  Scorecard  Number
0  Class A          York     Tom   x   x     32 NaN     York        0.0     653
1  Class A          York     Jim   x   x     10 NaN     York        0.0     742
2  Class A          York     Sam   x   x     32 NaN     York        0.0     931
3  Class B  Dumpton Park   Sarah   x   x    NaN NaN    Dumpt       23.5      12
4  Class B  Dumpton Park     Bob   x   x    NaN NaN    Dumpt       23.1     299
5  Class B  Dumpton Park    Bill   x   x    NaN NaN    Dumpt       23.4     198
6  Class A         Dover    Andy   x   x    NaN NaN    Dover       23.0     983
7  Class A         Dover  Hannah   x   x    NaN NaN    Dover        1.0     293
8  Class B        London   Jemma   x   x    NaN NaN     Lond       32.2       0
9  Class B        London   James   x   x    NaN NaN     Lond       32.0       0

要清理的是分数,我想我会把两者中的最大值放在一边:

In [16]: res['Score'] = res.loc[:, ['Score', 'Scorecard']].max(1)

In [17]: del res['Scorecard'] 
         del res['No']
         del res['Location']

然后你留下了你想要的列:

In [18]: res
Out[18]: 
     Class         Local    Name DPE JJK  Score  Number
0  Class A          York     Tom   x   x   32.0     653
1  Class A          York     Jim   x   x   10.0     742
2  Class A          York     Sam   x   x   32.0     931
3  Class B  Dumpton Park   Sarah   x   x   23.5      12
4  Class B  Dumpton Park     Bob   x   x   23.1     299
5  Class B  Dumpton Park    Bill   x   x   23.4     198
6  Class A         Dover    Andy   x   x   23.0     983
7  Class A         Dover  Hannah   x   x    1.0     293
8  Class B        London   Jemma   x   x   32.2       0
9  Class B        London   James   x   x   32.0       0

In [18]: res.to_csv('foo.csv')

注意:要强制dtype为object(并且具有混合dtypes,int和浮点数,而不是所有浮点数),您可以使用apply。 如果您正在进行任何分析,我建议不要这样做!

res['Score'] = res['Score'].apply(lambda x: int(x) if int(x) == x else x, convert_dtype=False)

答案 2 :(得分:5)

希望这段代码更具可读性。 ;)Python's new Enum type is here的后端。

from enum import Enum       # see PyPI for the backport (enum34)

class Field(Enum):

    course = 0
    location = 1
    student = 2
    dpe = 3
    jjk = 4
    score = -2
    number = -1

    def __index__(self):
        return self._value_

def Float(text):
    if not text:
        return 0.0
    return float(text)

def load_our_data(filename):
    "return a dict using the first three fields as the key"
    data = dict()
    with open(filename) as input:
        next(input)  # throw away header
        for line in input:
            fields = line.strip('\n').split(',')
            fields[Field.score] = Float(fields[Field.score])
            fields[Field.number] = Float(fields[Field.number])
            key = (
                fields[Field.course].lower(),
                fields[Field.location][:4].lower(),
                fields[Field.student].lower(),
                )
            data[key] = fields
    return data

def load_their_data(filename):
    "return a dict using the first three fields as the key"
    data = dict()
    with open(filename) as input:
        next(input)  # throw away header
        for line in input:
            fields = line.strip('\n').split(',')
            fields = [f.strip('"') for f in fields]
            fields[Field.score] = Float(fields[Field.score])
            fields[Field.number] = Float(fields[Field.number])
            key = (
                fields[Field.course].lower(),
                fields[Field.location][:4].lower(),
                fields[Field.student].lower(),
                )
            data[key] = fields
    return data

def merge_data(ours, theirs):
    "their data is only used if not blank and non-zero"
    for key, our_data in ours.items():
        their_data = theirs[key]
        if their_data[Field.score]:
            our_data[Field.score] = their_data[Field.score]
        if their_data[Field.number]:
            our_data[Field.number] = their_data[Field.number]

def write_our_data(data, filename):
    with open(filename, 'w') as output:
        for record in sorted(data.values()):
            line = ','.join([str(f) for f in record])
            output.write(line + '\n')

if __name__ == '__main__':
    ours = load_our_data('one.csv')
    theirs = load_their_data('two.csv')
    merge_data(ours, theirs)
    write_our_data(ours, 'three.csv')

答案 3 :(得分:4)

Python词典是这里的方法:

studentDict = {}

with open(<csv1>, 'r') as f:
  for line in f:
    LL = line.rstrip('\n').replace('"','').split(',')
    studentDict[LL[0], LL[1], LL[2]] = LL[3:]

with open(<csv2>, 'r') as f:
  for line in f:
    LL = line.rstrip('\n').replace('"','').split(',')
    if LL[-2] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-2] = LL[-2]
    if LL[-1] not in ('0', ''): studentDict[LL[0], LL[1], LL[2]][-1] = LL[-1]

with open(<outFile>, 'w') as f:
  for k in studentDict.keys():
    v = studentDict[k[0], k[1], k[2]]
    f.write(k[0] + ',' + k[1] + ',' + k[2] + ',' + v[0] + ',' + v[1] + ',' + v[2] + ',' + v[3] + '\n')

答案 4 :(得分:4)

pandas使这类任务更方便。

编辑:好的,因为你不能依赖手动重命名列,罗马的建议只是匹配前几个字母是一个很好的。我们不得不在此之前改变一些事情。

In [62]: df1 = pd.read_clipboard(sep=',')

In [63]: df2 = pd.read_clipboard(sep=',')

In [68]: df1
Out[68]: 
     Class Location Student  Scorecard  Number
0  Class A     York     Jim        0.0     742
1  Class A     York     Sam        0.0     931
2  Class A     York     Tom        0.0     653
3  Class B    Dumpt     Bob       23.1     299
4  Class B    Dumpt    Bill       23.4     198
5  Class B    Dumpt   Sarah       23.5      12
6  Class A    Dover    Andy       23.0     983
7  Class A    Dover  Hannah        1.0     293
8  Class B     Lond   Jemma       32.2       0
9  Class B     Lond   James       32.0       0

In [69]: df2
Out[69]: 
     Class         Local    Name DPE JJK  Score   No
0  Class A          York     Tom   x   x   32.0  653
1  Class A          York     Jim   x   x   10.0  742
2  Class A          York     Sam   x   x   32.0  653
3  Class B  Dumpton Park   Sarah   x   x   23.5   12
4  Class B  Dumpton Park     Bob   x   x   23.1  299
5  Class B  Dumpton Park    Bill   x   x   23.4  198
6  Class A         Dover    Andy   x   x   23.0  983
7  Class A         Dover  Hannah   x   x    1.0  293
8  Class B        London   Jemma   x   x   32.2  NaN
9  Class B        London   James   x   x   32.0  NaN

获取名称相同的列。

In [70]: df1 = df1.rename(columns={'Location': 'Local', 'Student': 'Name', 'Scorecard': 'Score', 'Number': 'No'}

现在的位置。将原件df2保存到单独的系列中。

In [71]: locations = df2['Local']

In [72]: df1['Local'] = df1['Local'].str.slice(0, 4)

In [73]: df2['Local'] = df2['Local'].str.slice(0, 4)

使用字符串方法截断到前4个(假设这不会导致任何错误匹配)。

现在设置索引:

In [78]: df1 = df1.set_index(['Class', 'Local', 'Name'])

In [79]: df2 = df2.set_index(['Class', 'Local', 'Name'])

In [80]: df1
Out[80]: 
                      Score   No
Class   Local Name              
Class A York  Jim       0.0  742
              Sam       0.0  931
              Tom       0.0  653
Class B Dump  Bob      23.1  299
              Bill     23.4  198
              Sarah    23.5   12
Class A Dove  Andy     23.0  983
              Hannah    1.0  293
Class B Lond  Jemma    32.2    0
              James    32.0    0

In [83]: df1 = df1.replace(0, np.nan)
In [84]: df2 = df2.replace(0, np.nan)

最后,像以前一样更新分数:

In [85]: df1.update(df2, overwrite=False)

您可以通过以下方式获取原始位置:

In [91]: df1 = df1.reset_index()
In [92]: df1['Local'] = locations

您可以使用df1.to_csv('path/to/csv')

写入输出到csv(and a bunch of other format

答案 5 :(得分:2)

您可以尝试使用标准库中的csv模块。我的解决方案与Chris H非常相似,但我使用csv模块来读写文件。 (事实上​​,我偷了他将密钥存储在列表中以保存订单的技术。)

如果使用csv模块,则不必过多担心引号,它还允许您直接将行读入列中,并将列名称作为键。

import csv

# Open first CSV, and read each line as a dictionary with column names as keys.
with open('csv1.csv', 'rb') as csvfile1:
    table1 = csv.DictReader(csvfile1,['Class', 'Local', 'Name',
                            'DPE', 'JJK', 'Score', 'No'])
    table1.next() #skip header row
    first_table = {}
    original_order = [] #list keys to save original order
    # build dictionary of rows with name, location, and class as keys
    for row in table1:
        id = "%s from %s in %s" % (row['Name'], row['Local'][:4], row['Class'])
        first_table[id] = row
        original_order.append(id)

# Repeat for second csv, but don't worry about order
with open('csv2.csv', 'rb') as csvfile2:
    table2 = csv.DictReader(csvfile2, ['Class', 'Location',
                            'Student', 'Scorecard', 'Number'])
    table2.next()
    second_table = {}
    for row in table2:
        id = "%s from %s in %s" % (row['Student'], row['Location'][:4], row['Class'])
        second_table[id] = row

with open('student_data.csv', 'wb') as finalfile:
    results = csv.DictWriter(finalfile, ['Class', 'Local', 'Name',
                             'DPE', 'JJK', 'Score', 'No'])
    results.writeheader()
    # Replace data in first csv with data in second csv when conditions are satisfied.
    for student in original_order:
        if second_table[student]['Scorecard'] != "0" and second_table[student]['Scorecard'] != "":
            first_table[student]['Score'] = second_table[student]['Scorecard']
        if second_table[student]['Number'] != "0" and second_table[student]['Number'] != "":
            first_table[student]['No'] = second_table[student]['Number']
        results.writerow(first_table[student])

希望这有帮助。