我有一个庞大的输入文件,即
con1 P1 140 602
con1 P2 140 602
con2 P5 642 732
con3 P8 17 348
con3 P9 17 348
我想在每个con中迭代,删除第[2]行和第[3]行中的重复元素,并将结果打印到新的.txt文件中,以便我的输出文件看起来像这样,(注意:我的第二列可能每个con)都不同
con1 P1 140 602
con2 P5 642 732
con3 P8 17 348
我尝试过的脚本(不知道如何完成)
from collections import defaultdict
start = defaultdict(int)
end = defaultdict(int)
o1=open('result1.txt','w')
o2=open('result2.txt','w')
with open('example.txt') as f:
for line in f:
line = line.split()
start[line[2]]
end[line[3]]
if start.keys() == 1 and end.keys() ==1:
o1.writelines(line)
else:
o2.write(line)
更新:附加示例
con20 EMT20540 951 1580
con20 EMT14935 975 1655
con20 EMT24081 975 1655
con20 EMT19916 975 1652
con20 EMT23831 975 1655
con20 EMT19915 975 1652
con20 EMT09010 975 1649
con20 EMT29525 975 1655
con20 EMT19914 975 1652
con20 EMT19913 975 1652
con20 EMT23832 975 1652
con20 EMT09009 975 1637
con20 EMT16812 975 1649
预期产出,
con20 EMT20540 951 1580
con20 EMT14935 975 1655
con20 EMT19916 975 1652
con20 EMT09010 975 1649
con20 EMT09009 975 1637
答案 0 :(得分:2)
您可以在此处使用itertools.groupby
:
from itertools import groupby
with open('input.txt') as f1, open('f_out', 'w') as f2:
#Firstly group the data by the first column
for k, g in groupby(f1, key=lambda x:x.split()[0]):
# Now during the iteration over each group, we need to store only
# those lines that have unique 3rd and 4th column. For that we can
# use a `set()`, we store all the seen columns in the set as tuples and
# ignore the repeated columns.
seen = set()
for line in g:
columns = tuple(line.rsplit(None, 2)[-2:])
if columns not in seen:
#The 3rd and 4th column were unique here, so
# store this as seen column and also write it to the file.
seen.add(columns)
f2.write(line.rstrip() + '\n')
print line.rstrip()
<强>输出:强>
con20 EMT20540 951 1580
con20 EMT14935 975 1655
con20 EMT19916 975 1652
con20 EMT09010 975 1649
con20 EMT09009 975 1637
答案 1 :(得分:1)
我说:
f = open('example.txt','r').readlines()
array = []
for line in f:
array.append(line.rstrip().split())
def func(array, j):
offset = []
if j < len(array):
firstRow = array[j-1]
for i in range(j, len(array)):
if (firstRow[3] == array[i][3] and firstRow[2] == array[i][2]
and firstRow[0] == array[i][0]):
offset.append(i)
for item in offset[::-1]:# Q. Why offset[::-1] and not offset?
del array[item]
return func(array, j=j+1)
func(array, 1)
for e in array:
print '%s\t\t%s\t\t%s\t%s' % (e[0],e[1],e[2],e[3])
方框说:
con20 EMT20540 951 1580
con20 EMT14935 975 1655
con20 EMT19916 975 1652
con20 EMT09010 975 1649
con20 EMT09009 975 1637
答案 2 :(得分:-1)
您可以按照以下方式执行此操作:
my_list = list(set(open(file_name, 'r')))
然后将其写入您的其他文件
>>> a = [1,2,3,4,3,2,3,2]
>>> my_list = list(set(a))
>>> print my_list
[1, 2, 3, 4]