该函数需要能够检查文件中每行和每列的重复项。
包含重复项的文件示例:
A B C
A A B
B C A
正如您所看到的,第2行中有2个A的副本,但第1列中有两个A' 代码:
def duplication_char(dc):
with open (dc,"r") as duplicatechars:
linecheck = duplicatechar.readlines()
linecheck = [line.split() for line in linecheck]
for row in linecheck:
if len(set(row)) != len(row):
print ("duplicates", " ".join(row))
for column in zip(*checkLine):
if len(set(column)) != len(column):
print ("duplicates"," ".join(column))
答案 0 :(得分:4)
嗯,我就是这样做的。
首先,读取您的文件并使用内容创建一个2d numpy数组:
import numpy
with open('test.txt', 'r') as fil:
lines = fil.readlines()
lines = [line.strip().split() for line in lines]
arr = numpy.array(lines)
然后,使用集合检查每一行是否有重复项(一个集合没有重复项,因此如果集合的长度不同于数组的长度,则该数组具有重复项):
for row in arr:
if len(set(row)) != len(row):
print 'Duplicates in row: ', row
然后,通过转置你的numpy数组来检查每个列是否有使用集的重复:
for col in arr.T:
if len(set(col)) != len(col):
print 'Duplicates in column: ', col
如果你将所有这些包装在一个函数中:
def check_for_duplicates(filename):
import numpy
with open(filename, 'r') as fil:
lines = fil.readlines()
lines = [line.strip().split() for line in lines]
arr = numpy.array(lines)
for row in arr:
if len(set(row)) != len(row):
print 'Duplicates in row: ', row
for col in arr.T:
if len(set(col)) != len(col):
print 'Duplicates in column: ', col
根据Apero的建议,你也可以使用zip(https://docs.python.org/3/library/functions.html#zip)而不是numpy这样做:
def check_for_duplicates(filename):
with open(filename, 'r') as fil:
lines = fil.readlines()
lines = [line.strip().split() for line in lines]
for row in lines:
if len(set(row)) != len(row):
print 'Duplicates in row: ', row
for col in zip(*lines):
if len(set(col)) != len(col):
print 'Duplicates in column: ', col
在您的示例中,此代码打印:
# Duplicates in row: ['A' 'A' 'B']
# Duplicates in column: ['A' 'A' 'B']
答案 1 :(得分:1)
您可以拥有列表列表并使用zip
转置它。
举个例子,试试:
from collections import Counter
with open(fn) as fin:
data=[line.split() for line in fin]
rowdups={}
coldups={}
for d, m in ((rowdups, data), (coldups, zip(*data))):
for i, sl in enumerate(m):
count=Counter(sl)
for c in count.most_common():
if c[1]>1:
d.setdefault(i, []).append(c)
>>> rowdups
{1: [('A', 2)]}
>>> coldups
{0: [('A', 2)]}