Question

编辑：我的问题在reddit上得到了解答。如果有人对这个问题的答案https://www.reddit.com/r/learnpython/comments/42ibhg/how_to_match_fields_from_two_lists_and_further/

感兴趣，可以使用以下链接

我试图从file1获取pos和alt字符串以匹配其中的内容 file2，相当简单。但是，file2在第17个拆分元素/列中具有值最后一个元素/列（第340个），其中包含字符串，例如1/1：1.2.2：51：12 我也想过滤。

我想从file2中提取包含/匹配pos和alt的file2中的行。此后，我想进一步过滤仅包含特定结果的匹配结果第17个拆分元素/列中的值。但要做到这一点，价值观必须如此被“：”拆分，所以我可以过滤split [0] =“1/1”并拆分[2]＆gt; 50.问题是我不知道该怎么做。

我想我将不得不迭代这些并分裂，但我不知道该怎么做因为代码目前处于循环中，我想要过滤的值是列而不是行。

任何建议都会非常感激，我从周五开始就遇到了这个问题尚未找到解决方案。

import os,itertools,re
file1 = open("file1.txt","r")
file2 = open("file2.txt","r")

matched = []

for (x),(y) in itertools.product(file2,file1):
    if not x.startswith("#"):
            cells_y = y.split("\t")
            pos_y = cells[0]
            alt_y = cells[3]

            cells_x = x.split("\t")
            pos_x = cells_x[0]+":"+cells_x[1]
            alt_x = cells_x[4]

            if pos_y in pos_x and alt_y in alt_x:
                    matched.append(x)

for z in matched:
    cells_z = z.split("\t")
    if cells_z[16:len(cells_z)]:

Answer 1

您的要求不明确，但您的意思是：

for (x),(y) in itertools.product(file2,file1):
    if x.startswith("#"):
        continue

    cells_y = y.split("\t")
    pos_y = cells[0]
    alt_y = cells[3]

    cells_x = x.split("\t")
    pos_x = cells_x[0]+":"+cells_x[1]
    alt_x = cells_x[4]

    if pos_y != pos_x: continue
    if alt_y != alt_x: continue

    extra_match = False

    for f in range(17, 341):
        y_extra = y[f].split(':')

        if y_extra[0] != '1/1': continue
        if y_extra[2] <= 50: continue
        extra_match = True
        break

    if not extra_match: continue

    xy = x + y
    matched.append(xy)

我选择将x和y连接到匹配的数组中，因为我不确定你是否想要所有的数据。如果没有，请随意回到附加x或y。

Answer 2

您可能需要查看csv库，它可以使用tab作为分隔符。您还可以使用生成器和/或警卫来使代码更加pythonic和有效。我认为你使用索引的方法效果很好，但是在尝试修改道路时很容易中断，或者如果文件行改变形状则更新。您可能希望创建对象（我在最后一部分使用NamedTuples）来表示您的线条，并使其更容易阅读/改进。

最后，请记住，Python有一个快捷功能，比较＆＃39; if＆＃39;

例如：

if x_evaluation and y_evaluation:
    do some stuff

当x_evaluation返回False时，Python将完全跳过y_evaluation。在您的代码中，每次迭代循环时都会计算cells_x[0]+":"+cells_x[1]。我没有存储这个值，而是等到更简单的alt比较评估为True，然后再进行（相对）更重/更粗的检查。

import csv

def filter_matching_alt_and_pos(first_file, second_file):
    for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
        for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
            # continue will skip the rest of this loop and go to the next value for y
            # this way, we can abort as soon as one value isn't what we want
            # .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
            if x[3] == y[4] and x[0] == ":".join(y[:1]):
                yield x

def match_datestamp_and_alt_and_pos(first_file, second_file):
    for z in filter_matching_alt_and_pos(first_file, second_file):
        for element in z[16:]:
            # I am not sure I fully understood your filter needs for the 2nd half. Here, I split all elements from the 17th onward and look for the two cases you mentioned. This seems like it might be very heavy, but at least we're using generators!
            # same idea as before, we abort as early as possible to avoid needless indexing and checks
            for chunk in element.split(":"):
                # WARNING: if you aren't 100% sure the 2nd element is an int, this is very dangerous
                # here, I use the continue keyword and the negative-check to help eliminate excess overhead. The execution is very similar as above, but might be easier to read/understand and can help speed things along in some cases
                # once again, I do the lighter check before the heavier one
                if not int(chunk[2])> 50:
                    # continue automatically skips to the next iteration on element
                    continue
                if not chunk[:1] == "1/1":
                    continue
                yield z


if __name__ == '__main__':
    first_file = "first.txt"
    second_file = "second.txt"
    # match_datestamp_and_alt_and_pos returns a generator; for loop through it for the lines which matched all 4 cases
    match_datestamp_and_alt_and_pos(first_file=first_file, second_file=second_file)

第一部分的命名元组

from collections import namedtuple
FirstFileElement = namedtuple("FirstFrameElement", "pos unused1 unused2 alt")
SecondFileElement = namedtuple("SecondFrameElement", "pos1 pos2 unused2 unused3 alt")

def filter_matching_alt_and_pos(first_file, second_file):
    for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
        for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
            # continue will skip the rest of this loop and go to the next value for y
            # this way, we can abort as soon as one value isn't what we want
            # .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
            x_element = FirstFileElement(*x)
            y_element = SecondFileElement(*y)
            if x.alt == y.alt and x.pos == ":".join([y.pos1, y.pos2]):
                yield x

如何匹配两个列表中的字段，并根据后续字段中的值进一步过滤？

2 个答案: