Question

我有一个庞大的输入文件，即

con1    P1  140 602
con1    P2  140 602
con2    P5  642 732
con3    P8  17  348
con3    P9  17  348

我想在每个con中迭代，删除第[2]行和第[3]行中的重复元素，并将结果打印到新的.txt文件中，以便我的输出文件看起来像这样，（注意：我的第二列可能每个con）都不同

con1    P1  140 602
con2    P5  642 732
con3    P8  17  348

我尝试过的脚本（不知道如何完成）

from collections import defaultdict
start = defaultdict(int)
end = defaultdict(int)
o1=open('result1.txt','w')
o2=open('result2.txt','w')
with open('example.txt') as f:
    for line in f:
        line = line.split()
        start[line[2]]
        end[line[3]]
        if start.keys() == 1 and end.keys() ==1:
            o1.writelines(line)
        else:
            o2.write(line)

更新：附加示例

con20   EMT20540    951 1580
con20   EMT14935    975 1655
con20   EMT24081    975 1655
con20   EMT19916    975 1652
con20   EMT23831    975 1655
con20   EMT19915    975 1652
con20   EMT09010    975 1649
con20   EMT29525    975 1655
con20   EMT19914    975 1652
con20   EMT19913    975 1652
con20   EMT23832    975 1652
con20   EMT09009    975 1637
con20   EMT16812    975 1649

预期产出，

con20   EMT20540    951 1580
con20   EMT14935    975 1655
con20   EMT19916    975 1652
con20   EMT09010    975 1649
con20   EMT09009    975 1637

Answer 1

您可以在此处使用itertools.groupby：

from itertools import groupby

with open('input.txt') as f1, open('f_out', 'w') as f2:
    #Firstly group the data by the first column
    for k, g in groupby(f1, key=lambda x:x.split()[0]):
        # Now during the iteration over each group, we need to store only
        # those lines that have unique 3rd and 4th column. For that we can
        # use a `set()`, we store all the seen columns in the set as tuples and
        # ignore the repeated columns.   

        seen = set()
        for line in g:
            columns = tuple(line.rsplit(None, 2)[-2:])
            if columns not in seen:
                #The 3rd and 4th column were unique here, so
                # store this as seen column and also write it to the file.
                seen.add(columns)
                f2.write(line.rstrip() + '\n') 
                print line.rstrip()

<强>输出：

con20   EMT20540    951 1580
con20   EMT14935    975 1655
con20   EMT19916    975 1652
con20   EMT09010    975 1649
con20   EMT09009    975 1637

Answer 2

我说：

f = open('example.txt','r').readlines()
array = []

for line in f:
  array.append(line.rstrip().split())


def func(array, j):
  offset = []
  if j < len(array):
    firstRow = array[j-1]
    for i in range(j, len(array)):
      if (firstRow[3] == array[i][3] and firstRow[2] == array[i][2]
        and firstRow[0] == array[i][0]):
        offset.append(i)

    for item in offset[::-1]:# Q. Why offset[::-1] and not offset?
      del array[item]

    return func(array, j=j+1)

func(array, 1)

for e in array:
  print '%s\t\t%s\t\t%s\t%s' % (e[0],e[1],e[2],e[3])

方框说：

con20   EMT20540    951 1580
con20   EMT14935    975 1655
con20   EMT19916    975 1652
con20   EMT09010    975 1649
con20   EMT09009    975 1637

Answer 3

您可以按照以下方式执行此操作：

my_list = list(set(open(file_name, 'r')))

然后将其写入您的其他文件

简单示例

>>> a = [1,2,3,4,3,2,3,2]
>>> my_list = list(set(a))

>>> print my_list
[1, 2, 3, 4]

在单独的.txt文件中打印行中的唯一元素

3 个答案:

简单示例