Question

我有一个这样的文件：

x 48012  F 1.000
x 48169  R 0.361
x 87041  R 0.118
x 9032   R 0.176
x 9150   R 0.521

我想根据第1,2和3列是否相同来过滤掉结果文件中的行的唯一值 - 第2列的容差为+/- 200。所以例如前两行

x 48012  F 1.000
x 48169  R 0.361

会变成

x 48012  F 1.000

因为48169-48012是157并且在±200范围内

总的来说，结束文件是

    x 48012  F 1.000
    x 87041  R 0.118
    x 9032   R 0.176

我试过

out=open('result.txt', 'w')
my_file= open('test.txt', 'r')
seen = set()
for line in my_file:
        line=line.strip().split('\t')
        if line[0]==seen[0] and line[2]==seen[2] and ((int(line[1])==int(seen[1]-200)) or (int(line[1])==(seen[1]-200))):
            out.write(line)

但无法将集合编入索引

Answer 1

试试这个：

with open('result.txt', 'w') as out:
    with open('file_36086075.txt', 'r') as my_file:
        row1 = None
        row2 = None
        for line in my_file:
            if not row1:
                row1=line.strip().split('\t')
            else:
                if not row2:
                    row2=line.strip().split('\t')
            if row1 and row2:
                diff = int(row1[1]) - int(row2[1])
                if row1[0]==row2[0] and row1[2]==row2[2] and (diff >= -200 and diff <= 200):
                    out.write('\t'.join(row1) + '\n')
                    row1 = None
                    row2 = None
                else:
                    out.write('\t'.join(row1) + '\n')
                    row1 = row2
                    row2 = None

Answer 2

在此处使用Set是没有意义的，因为您必须将元素分解为标记，以便管理起来很难。我会使用一对二维数组，一个用于候选行，一个用于结果。

我会将整个文件读入候选数组并创建一个空结果数组。然后我将遍历候选数组并在结果数组中查找匹配项。如果我在结果数组中找不到匹配项，我会将候选项复制到结果数组中。

类似的东西：

candidates = []
results = []
for line in my_file:
    candidates.append(line.split('\t'))
for line in candidates:
    seen = false
    for possible_match in results:
        if matching_line(possible_match, line):
            seen = true
    if seen:
        continue
    else:
        results.append(line)

然后你需要一个函数来决定两个数组是否匹配：

function matching_line(array1, array2):
    if array1[0] = array2[0]
    ..etc

范围为

2 个答案: