在每行和每列中查找重复项

时间:2016-03-10 17:25:36

标签: python python-3.x

该函数需要能够检查文件中每行和每列的重复项。

包含重复项的文件示例:

A B C
A A B
B C A

正如您所看到的,第2行中有2个A的副本,但第1列中有两个A' 代码:

def duplication_char(dc):
    with open (dc,"r") as duplicatechars: 
        linecheck = duplicatechar.readlines()
    linecheck = [line.split() for line in linecheck]

    for row in linecheck:
        if len(set(row)) != len(row):
            print ("duplicates", " ".join(row))


    for column in zip(*checkLine):
        if len(set(column)) != len(column):
            print ("duplicates"," ".join(column))

2 个答案:

答案 0 :(得分:4)

嗯,我就是这样做的。

首先,读取您的文件并使用内容创建一个2d numpy数组:

import numpy
with open('test.txt', 'r') as fil:
    lines = fil.readlines()
lines = [line.strip().split() for line in lines]
arr = numpy.array(lines)

然后,使用集合检查每一行是否有重复项(一个集合没有重复项,因此如果集合的长度不同于数组的长度,则该数组具有重复项):

for row in arr:
    if len(set(row)) != len(row):
        print 'Duplicates in row: ', row

然后,通过转置你的numpy数组来检查每个列是否有使用集的重复:

for col in arr.T:
    if len(set(col)) != len(col):
        print 'Duplicates in column: ', col

如果你将所有这些包装在一个函数中:

def check_for_duplicates(filename):
    import numpy
    with open(filename, 'r') as fil:
        lines = fil.readlines()
    lines = [line.strip().split() for line in lines]
    arr = numpy.array(lines)

    for row in arr:
        if len(set(row)) != len(row):
            print 'Duplicates in row: ', row

    for col in arr.T:
        if len(set(col)) != len(col):
            print 'Duplicates in column: ', col

根据Apero的建议,你也可以使用zip(https://docs.python.org/3/library/functions.html#zip)而不是numpy这样做:

def check_for_duplicates(filename):
    with open(filename, 'r') as fil:
        lines = fil.readlines()
    lines = [line.strip().split() for line in lines]

    for row in lines:
        if len(set(row)) != len(row):
            print 'Duplicates in row: ', row

    for col in zip(*lines):
        if len(set(col)) != len(col):
            print 'Duplicates in column: ', col

在您的示例中,此代码打印:

# Duplicates in row:  ['A' 'A' 'B']
# Duplicates in column:  ['A' 'A' 'B']

答案 1 :(得分:1)

您可以拥有列表列表并使用zip转置它。

举个例子,试试:

from collections import Counter

with open(fn) as fin:
    data=[line.split() for line in fin]

rowdups={}  
coldups={}
for d, m in ((rowdups, data), (coldups, zip(*data))):   
    for i, sl in enumerate(m):
        count=Counter(sl)
        for c in count.most_common():
            if c[1]>1:
                d.setdefault(i, []).append(c)

>>> rowdups 
{1: [('A', 2)]}
>>> coldups 
{0: [('A', 2)]}