Question

我有两个包含4行的数据文件。我需要提取第二个文件中包含的4行的集合，如果每个集合的第1行的一部分匹配。

以下是输入数据的示例：

input1.txt
@abcde:134/1
JDOIJDEJAKJ
content1
content2

input2.txt
@abcde:134/2
JKDJFLJSIEF
content3
content4
@abcde:135/2
KFJKDJFKLDJ
content5
content6

输出应该是这样的：

output.txt
@abcde:134/2
JKDJFLJSIEF
content3
content4

这是我尝试编写代码......

import sys

filename1 = sys.argv[1] #input1.txt
filename2 = sys.argv[2] #input2.txt

F = open(filename1, 'r')
R = open(filename2, 'r')

def output(input1, input2):
    for line in input1:
        if "@" in line:
            for line2 in input2:
                if line[:-1] in line2:
                    for i in range(4):
                        print next(input2)

output = output(F, R)
write(output)

我得到了next（）的无效语法，我无法弄明白，如果有人能够纠正我的代码或给我提供如何使其工作的提示，我会很高兴。

=== EDIT === 好的，我想我已经设法实现了下面评论中提出的解决方案（谢谢）。我现在在通过ssh连接到远程Ubuntu服务器的终端会话上运行代码。这是代码现在的样子。（这次我正在运行python2.7）

filename1 = sys.argv[1] #input file 1
filename2 = sys.argv[2] #input file 2 (some lines of which will be in the output)

F = open(filename1, 'r')
R = open(filename2, 'r')

def output(input1, input2):
    for line in input1:
        input2.seek(0)
        if "@" in line:
            for line2 in input2:
                if line[:-2] in line2:
                    for i in range(4):
                        out = next(input2)
                        print out
                        return

output (F, R)

然后我运行这个命令：

python fetch_reverse.py test1.fq test.fq > test2.fq

我没有收到任何警告，但输出文件为空。我做错了什么？

Answer 1

从读取第二个文件中拆分读取第一个文件;收集你想要匹配的所有行（除非你正在阅读数十万行以匹配）。存储您想要匹配的所有行，减去末尾的数字，以便快速访问。

然后扫描另一个文件以找到匹配的行：

def output(input1, input2):
    with input1:  # automatically close when done
        # set comprehension of all lines starting with @, minus last character
        to_match = {line.strip()[:-1] for line in input1 if line[0] == '@'}

    with input2:
        for line in input2:
            if line[0] == '@' and line.strip()[:-1] in to_match:
                print line.strip()
                for i in range(3):
                    print next(input2, '').strip()

您也需要打印匹配的行，然后读取下一行三行行（已读取行号1）。

python - 提取匹配字符串后面的几行

1 个答案: