我想浏览一个csv文件的每一行并进行比较,以查看第1行的第一个字段是否与下一行的第一个字段相同,依此类推。如果它找到匹配,那么我想忽略包含相同字段的那两行并保持没有匹配的行
这是一个示例数据集(no_dup.txt)
Ac_Gene_ID M_Gene_ID
ENSGMOG00000015632 ENSORLG00000010573
ENSGMOG00000015632 ENSORLG00000010585
ENSGMOG00000003747 ENSORLG00000006947
ENSGMOG00000003748 ENSORLG00000004636
基本上我想排除第1行和第2行,因为它们包含相同的字段(ENSGMOG00000015632)并保留第3行和第4行
以下是我尝试但无法完成的代码
prev = None
with open("no_dup.txt", 'r') as fh_in:
for line in fh_in:
line = line.strip()
if line.startswith("E"):
line1 = line.split()
print "initial gene =", line1[0]
if prev is not None or prev!= line1[0]:
prev = line1[0]
答案 0 :(得分:1)
我认为干净的方法是制作每个条目的地图 - >行列表。
entries = {}
with open('no_dup.txt', 'r') as fh_in:
for line in fg_in:
entry = line.split()[0]
if entry in entries:
entries[entry].append(line)
else:
entries[entry] = [line]
for matches in entries.iteritems():
if len(matches) == 1:
print matches[0]
您应该注意,这不会保留条目的顺序。
答案 1 :(得分:0)
你的开始看起来不错:
def filter_dups(iterable):
prev = None
for line in iterable:
if line.startswith("E"):
if prev.split(None, 1)[0] == line.split(None, 1)[0]:
prev = None
else:
if prev is not None:
yield prev
else:
prev = line
else:
yield line
prev = None
if prev is not None:
yield prev
with open("no_dup.txt", 'r') as fh_in:
with open("no_dup_out.txt", 'r') as fh_out:
fh_out.writelines(filter_dups(fh_in))
答案 2 :(得分:0)
您可以使用:
with open('a.txt','r') as inputFile:
lines = inputFile.readlines()
prev = lines[0]
for i in range(1, len(lines)):
cur = lines[i]
if prev.split()[0] != cur.split()[0]:
print prev.strip()
prev = cur
print lines[-1].strip()
输入:
ENSGMOG00000015632 ENSORLG00000010573
ENSGMOG00000015632 ENSORLG00000010585
ENSGMOG00000003747 ENSORLG00000006947
ENSGMOG00000003748 ENSORLG00000004636
输出:
ENSGMOG00000015632 ENSORLG00000010585
ENSGMOG00000003747 ENSORLG00000006947
ENSGMOG00000003748 ENSORLG00000004636