如何匹配两个列表中的字段,并根据后续字段中的值进一步过滤?

时间:2016-01-24 22:39:23

标签: python list loops filter iterator

编辑:我的问题在reddit上得到了解答。如果有人对这个问题的答案https://www.reddit.com/r/learnpython/comments/42ibhg/how_to_match_fields_from_two_lists_and_further/

感兴趣,可以使用以下链接

我试图从file1获取pos和alt字符串以匹配其中的内容 file2,相当简单。但是,file2在第17个拆分元素/列中具有值 最后一个元素/列(第340个),其中包含字符串,例如1/1:1.2.2:51:12 我也想过滤。

我想从file2中提取包含/匹配pos和alt的file2中的行。 此后,我想进一步过滤仅包含特定结果的匹配结果 第17个拆分元素/列中的值。但要做到这一点,价值观必须如此 被“:”拆分,所以我可以过滤split [0] =“1/1”并拆分[2]> 50.问题是 我不知道该怎么做。

我想我将不得不迭代这些并分裂,但我不知道该怎么做 因为代码目前处于循环中,我想要过滤的值是列而不是行。

任何建议都会非常感激,我从周五开始就遇到了这个问题 尚未找到解决方案。

import os,itertools,re
file1 = open("file1.txt","r")
file2 = open("file2.txt","r")

matched = []

for (x),(y) in itertools.product(file2,file1):
    if not x.startswith("#"):
            cells_y = y.split("\t")
            pos_y = cells[0]
            alt_y = cells[3]

            cells_x = x.split("\t")
            pos_x = cells_x[0]+":"+cells_x[1]
            alt_x = cells_x[4]

            if pos_y in pos_x and alt_y in alt_x:
                    matched.append(x)

for z in matched:
    cells_z = z.split("\t")
    if cells_z[16:len(cells_z)]:

2 个答案:

答案 0 :(得分:0)

您的要求不明确,但您的意思是:

for (x),(y) in itertools.product(file2,file1):
    if x.startswith("#"):
        continue

    cells_y = y.split("\t")
    pos_y = cells[0]
    alt_y = cells[3]

    cells_x = x.split("\t")
    pos_x = cells_x[0]+":"+cells_x[1]
    alt_x = cells_x[4]

    if pos_y != pos_x: continue
    if alt_y != alt_x: continue

    extra_match = False

    for f in range(17, 341):
        y_extra = y[f].split(':')

        if y_extra[0] != '1/1': continue
        if y_extra[2] <= 50: continue
        extra_match = True
        break

    if not extra_match: continue

    xy = x + y
    matched.append(xy)

我选择将x和y连接到匹配的数组中,因为我不确定你是否想要所有的数据。如果没有,请随意回到附加x或y。

答案 1 :(得分:0)

您可能需要查看csv库,它可以使用tab作为分隔符。您还可以使用生成器和/或警卫来使代码更加pythonic和有效。我认为你使用索引的方法效果很好,但是在尝试修改道路时很容易中断,或者如果文件行改变形状则更新。您可能希望创建对象(我在最后一部分使用NamedTuples)来表示您的线条,并使其更容易阅读/改进。

最后,请记住,Python有一个快捷功能,比较&#39; if&#39;

例如:

if x_evaluation and y_evaluation:
    do some stuff

当x_evaluation返回False时,Python将完全跳过y_evaluation。在您的代码中,每次迭代循环时都会计算cells_x[0]+":"+cells_x[1]。我没有存储这个值,而是等到更简单的alt比较评估为True,然后再进行(相对)更重/更粗的检查。

import csv

def filter_matching_alt_and_pos(first_file, second_file):
    for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
        for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
            # continue will skip the rest of this loop and go to the next value for y
            # this way, we can abort as soon as one value isn't what we want
            # .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
            if x[3] == y[4] and x[0] == ":".join(y[:1]):
                yield x

def match_datestamp_and_alt_and_pos(first_file, second_file):
    for z in filter_matching_alt_and_pos(first_file, second_file):
        for element in z[16:]:
            # I am not sure I fully understood your filter needs for the 2nd half. Here, I split all elements from the 17th onward and look for the two cases you mentioned. This seems like it might be very heavy, but at least we're using generators!
            # same idea as before, we abort as early as possible to avoid needless indexing and checks
            for chunk in element.split(":"):
                # WARNING: if you aren't 100% sure the 2nd element is an int, this is very dangerous
                # here, I use the continue keyword and the negative-check to help eliminate excess overhead. The execution is very similar as above, but might be easier to read/understand and can help speed things along in some cases
                # once again, I do the lighter check before the heavier one
                if not int(chunk[2])> 50:
                    # continue automatically skips to the next iteration on element
                    continue
                if not chunk[:1] == "1/1":
                    continue
                yield z


if __name__ == '__main__':
    first_file = "first.txt"
    second_file = "second.txt"
    # match_datestamp_and_alt_and_pos returns a generator; for loop through it for the lines which matched all 4 cases
    match_datestamp_and_alt_and_pos(first_file=first_file, second_file=second_file)

第一部分的命名元组

from collections import namedtuple
FirstFileElement = namedtuple("FirstFrameElement", "pos unused1 unused2 alt")
SecondFileElement = namedtuple("SecondFrameElement", "pos1 pos2 unused2 unused3 alt")

def filter_matching_alt_and_pos(first_file, second_file):
    for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
        for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
            # continue will skip the rest of this loop and go to the next value for y
            # this way, we can abort as soon as one value isn't what we want
            # .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
            x_element = FirstFileElement(*x)
            y_element = SecondFileElement(*y)
            if x.alt == y.alt and x.pos == ":".join([y.pos1, y.pos2]):
                yield x