编辑:我的问题在reddit上得到了解答。如果有人对这个问题的答案https://www.reddit.com/r/learnpython/comments/42ibhg/how_to_match_fields_from_two_lists_and_further/
感兴趣,可以使用以下链接我试图从file1获取pos和alt字符串以匹配其中的内容 file2,相当简单。但是,file2在第17个拆分元素/列中具有值 最后一个元素/列(第340个),其中包含字符串,例如1/1:1.2.2:51:12 我也想过滤。
我想从file2中提取包含/匹配pos和alt的file2中的行。 此后,我想进一步过滤仅包含特定结果的匹配结果 第17个拆分元素/列中的值。但要做到这一点,价值观必须如此 被“:”拆分,所以我可以过滤split [0] =“1/1”并拆分[2]> 50.问题是 我不知道该怎么做。
我想我将不得不迭代这些并分裂,但我不知道该怎么做 因为代码目前处于循环中,我想要过滤的值是列而不是行。
任何建议都会非常感激,我从周五开始就遇到了这个问题 尚未找到解决方案。
import os,itertools,re
file1 = open("file1.txt","r")
file2 = open("file2.txt","r")
matched = []
for (x),(y) in itertools.product(file2,file1):
if not x.startswith("#"):
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y in pos_x and alt_y in alt_x:
matched.append(x)
for z in matched:
cells_z = z.split("\t")
if cells_z[16:len(cells_z)]:
答案 0 :(得分:0)
您的要求不明确,但您的意思是:
for (x),(y) in itertools.product(file2,file1):
if x.startswith("#"):
continue
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y != pos_x: continue
if alt_y != alt_x: continue
extra_match = False
for f in range(17, 341):
y_extra = y[f].split(':')
if y_extra[0] != '1/1': continue
if y_extra[2] <= 50: continue
extra_match = True
break
if not extra_match: continue
xy = x + y
matched.append(xy)
我选择将x和y连接到匹配的数组中,因为我不确定你是否想要所有的数据。如果没有,请随意回到附加x或y。
答案 1 :(得分:0)
您可能需要查看csv库,它可以使用tab作为分隔符。您还可以使用生成器和/或警卫来使代码更加pythonic和有效。我认为你使用索引的方法效果很好,但是在尝试修改道路时很容易中断,或者如果文件行改变形状则更新。您可能希望创建对象(我在最后一部分使用NamedTuples)来表示您的线条,并使其更容易阅读/改进。
最后,请记住,Python有一个快捷功能,比较&#39; if&#39;
例如:
if x_evaluation and y_evaluation:
do some stuff
当x_evaluation返回False时,Python将完全跳过y_evaluation。在您的代码中,每次迭代循环时都会计算cells_x[0]+":"+cells_x[1]
。我没有存储这个值,而是等到更简单的alt比较评估为True,然后再进行(相对)更重/更粗的检查。
import csv
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
if x[3] == y[4] and x[0] == ":".join(y[:1]):
yield x
def match_datestamp_and_alt_and_pos(first_file, second_file):
for z in filter_matching_alt_and_pos(first_file, second_file):
for element in z[16:]:
# I am not sure I fully understood your filter needs for the 2nd half. Here, I split all elements from the 17th onward and look for the two cases you mentioned. This seems like it might be very heavy, but at least we're using generators!
# same idea as before, we abort as early as possible to avoid needless indexing and checks
for chunk in element.split(":"):
# WARNING: if you aren't 100% sure the 2nd element is an int, this is very dangerous
# here, I use the continue keyword and the negative-check to help eliminate excess overhead. The execution is very similar as above, but might be easier to read/understand and can help speed things along in some cases
# once again, I do the lighter check before the heavier one
if not int(chunk[2])> 50:
# continue automatically skips to the next iteration on element
continue
if not chunk[:1] == "1/1":
continue
yield z
if __name__ == '__main__':
first_file = "first.txt"
second_file = "second.txt"
# match_datestamp_and_alt_and_pos returns a generator; for loop through it for the lines which matched all 4 cases
match_datestamp_and_alt_and_pos(first_file=first_file, second_file=second_file)
第一部分的命名元组
from collections import namedtuple
FirstFileElement = namedtuple("FirstFrameElement", "pos unused1 unused2 alt")
SecondFileElement = namedtuple("SecondFrameElement", "pos1 pos2 unused2 unused3 alt")
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
x_element = FirstFileElement(*x)
y_element = SecondFileElement(*y)
if x.alt == y.alt and x.pos == ":".join([y.pos1, y.pos2]):
yield x