比较2个excel文件,保持1列固定为1张,然后使用python与具有相同col的另一个文件进行比较

时间:2017-06-12 11:39:28

标签: python excel excel-2010 xlrd

我们有2个excel文件,一个有7.5k记录,另一个有7k记录。我们需要比较数据,方法是将一个特定列固定在一个工作表上,以便与另一个工作表进行比较。

例如sheet1:

**Emp_ID|   Name|   Phone|  Address**
-------------------------------------
1       |     A |    123 |  ABC
-------------------------------------
2       |     B |    456 |  CBD
-------------------------------------
3       |     C |    789 |  S

例如sheet2:

**Emp_ID|   Name|   Phone|  Address**
-------------------------------------
1       |     A |    123 |  ABC
-------------------------------------
3       |     C |    789 |  S

在执行python脚本时,在将参数作为Emp_ID传递时,Python比较应基于Emp_ID和Emp_ID = 2输出为缺失。 我正在尝试使用XLRD模块,但它只是逐个单元格比较,而不是冻结一列,然后将该行与其他excel文件进​​行比较。

def compareexcel(oldSheet, newSheet):
        rowb2 = xlrd.open_workbook(oldSheet)
        rowb1 = xlrd.open_workbook(newSheet)
        sheet1 = rowb1.sheet_by_index(0)
        sheet2 = rowb2.sheet_by_index(0)

        for rownum in range(max(sheet1.nrows, sheet2.nrows)):
            if rownum < sheet1.nrows:
                row_rb1 = sheet1.row_values(rownum)
                row_rb2 = sheet2.row_values(rownum)

                for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
                    if c1 != c2:                    
                        print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)

2 个答案:

答案 0 :(得分:1)

我已经编写了一个函数来搜索另一个工作表中的列值,并且基于该比较将在比较函数中进行

def search(sheet2 , s):
    for row in range(sheet2.nrows):`enter code here`
        if s == sheet2.cell(row,0).value:
            return (row,0)
    return (9,9)

def compare(oldPerPaxSheet,newPerPaxSheet):
    rb1 = xlrd.open_workbook(oldPerPaxSheet)
    rb2 = xlrd.open_workbook(newPerPaxSheet)
    sheet1 = rb1.sheet_by_index(0)
    sheet2 = rb2.sheet_by_index(0)

    for rownum in range(max(self.sheet1.nrows, self.sheet2.nrows)):
            if rownum < sheet1.nrows:
                    row_rb1 = sheet1.row_values(rownum)
                    print ("row_rb1 : "), row_rb1

                    search_str = sheet1.cell(rownum,0).value

                    r,c = search(sheet2,search_str)
                    if (c != 9):
                            row_rb2 = sheet2.row_values(r)
                            for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
                                    if c1 != c2:                    
                                            print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)
                    else:
                            print ("ROw does not exists in the other sheet")
                            pass
            else:
                    print ("Row {} missing").format(rownum+1)

答案 1 :(得分:0)

您可以轻松使用pandas.read_excel

我将以Emp_ID作为索引

制作2个DataFrame
import pandas as pd

sheets = pd.read_excel(excel_filename, sheetname=[old_sheet, new_sheet], index_col=0)
sheet1 = sheets[old_sheet]
sheet2 = sheets[new_sheet]

我添加了一些行以获得更清晰的差异

  

Sheet 1中

    Name    Phone   Address
Emp_ID          
1   A   123     ABC
2   B   456     CBD
3   C   789     S
5   A   123     ABC
  

Sheet 2中

    Name    Phone   Address
Emp_ID          
1   A   123     ABC
3   C   789     S
4   D   12  A
5   E   123     ABC

计算缺失的Emp_ID变得非常简单

missing_in_1  = set(sheet2.index) - set(sheet1.index) 
missing_in_2  = set(sheet1.index) - set(sheet2.index) 
  

missing_in_1,m​​issing_in_2

({4}, {2})

所以sheet1没有Emp_ID 4,它位于sheet2中,而sheet2缺少2,正如预期的那样

然后,为了寻找差异,我们在2张纸上进行内连接

combined = pd.merge(sheet1, sheet2, left_index=True, right_index=True, suffixes=('_1', '_2'))
  

结合

    Name_1  Phone_1     Address_1   Name_2  Phone_2     Address_2
Emp_ID                      
1   A   123     ABC     A   123     ABC
3   C   789     S   C   789     S
5   A   123     ABC     E   123     ABC

并遍历sheet1的列以查找差异并将其保存在dict

differences = {}
for column in sheet1.columns:
    diff = combined[column+'_1'] != combined[column+'_2']
    if diff.any():
        differences[column] = list(combined[diff].index)
  

差异

{'Name': [5]}

如果您想要整个差异列表,请将最后一行更改为differences[column] = combined[diff]

  

差异

{'Name':        
         Name_1  Phone_1 Address_1 Name_2  Phone_2 Address_2
 Emp_ID                                                    
 5           A      123       ABC      E      123       ABC}