使用python删除excel中具有特定列组合的重复行

时间:2017-03-13 08:20:37

标签: python xlrd

我有一个python程序,它读取excel文档。我只需要允许某些列组合的第一次出现。 例如:

{%if sub_course_grade.grade is None%}
<script>
        window.onload = function(){
                reason_field = document.getElementById("id_reason");
                reason_field.parentElement.style.display = "none";
        }
</script>
{%else%}
<script>
        window.onload = function(){
                reason_field = document.getElementById("id_reason");
                reason_field.setAttribute("required", "");
        }
</script>
{%endif%}

我想删除/跳过复制找到的第三行并将其写入CSV文件。 这是我到目前为止所尝试的功能。但它没有用。

    A     |  B
  -------------
  1.  200 | 201   
  2.  200 | 202
  3.  200 | 201
  4.  200 | 203
  5.  201 | 201
  6.  201 | 202
  .............

4 个答案:

答案 0 :(得分:2)

mylist = []使用了两次,分配单个值会使其变得困难。应该是这样的:

mylist = []
for row in range(1, number_of_rows):  
    mylist.append((sheet.cell_value(row, 0), sheet.cell_value(row, 1)))

myset = set(mylist)

请注意,set未订购。如果您想按顺序查看结果,请同时检查this

答案 1 :(得分:2)

它对我有用:在python 2.7中

def validateExcel(filename):
   xls=xlrd.open_workbook(filename)  
   setcount = 0
   column = 0
   count = 0
   # sheetcount = 0
   for sheet in xls.sheets():
       header=""
       # sheetcount = sheetcount + 1
       number_of_rows = sheet.nrows
       number_of_columns = sheet.ncols
       sheetname = sheet.name          
       mylist = []
       for row in range(1, number_of_rows):  
            mylist.append((sheet.cell_value(row, 0), sheet.cell_value(row, 1)))
       myset = sorted(set(mylist), key=mylist.index)
       return myset

答案 2 :(得分:2)

这是我的解决方案。删除重复项并创建一个没有重复项的新文件。

double approx(vector<Point> const& pts)

答案 3 :(得分:1)

这应该将行(在本例中称为子列表)附加到mylist列表中(如果尚未放入)。这应该按照在xlsx文件中找到的顺序为您提供重复数据删除的列表。如果可以,可能值得一看pandas库。如果没有,这应该有所帮助:

def validateExcel(filename):

    xls=xlrd.open_workbook(filename)  

    for sheet in xls.sheets():
        header=""

        number_of_rows = sheet.nrows
        number_of_columns = sheet.ncols
        sheetname = sheet.name          

        mylist = []

        for row in range (1, number_of_rows):  
            sublist = [sheet.cell_value(row, col) for col in range(0, number_of_cols)]

            if sublist not in mylist:
                mylist.append(sublist)

            print mylist

     return mylist

编辑:

如果您有一个包含多个工作表的xlsx文件,您可以使用dict存储重复数据删除的行数据,并将工作表名称作为键,然后将该dict传递给csv写入函数:< / p>

def validateExcel(filename):

    outputDict = {}

    xls=xlrd.open_workbook(filename)  

    sheetCount = 0

    for sheet in xls.sheets():

        number_of_rows = sheet.nrows
        number_of_columns = sheet.ncols

        sheetname = sheet.name          

        if not sheetname:
            sheetname = str(sheetCount)

        outputDict[str(sheetCount)] = []

        for row in range (1, number_of_rows):  
            sublist = [sheet.cell_value(row, col) for col in in range(0,number_of_cols)]

            if sublist not in outputDict[sheetname]:
                outputDict[sheetname].append(sublist)

            print outputDict[sheetname]

         sheetCount += 1

     return outputDict

# will go through the generated dictionary and write the data to csv files
def writeToFiles(generatedDictionary):

    for key generatedDictionary:
        with open(key + ".csv") as csvFile:
            writer = csv.writer(csvFile)
            writer.writerows(generatedDictionary[key])

如果你可以使用pandas,这样的东西可以起作用:

import pandas as pd

df = pd.read_excel(filename)

for name in df.sheetnames:

    sheetDataFrame = df.parse(name)
    filtered = sheetDataFrame.drop_duplicates()

    filtered.to_csv(name + ".csv")