Question

我正在使用带有python子进程的运行c ++可执行文件（一种名为blast的生物信息学软件）的hadoop流。 Blast将在命令行执行时输出结果文件。但是当在hadoop上运行时，我找不到blast的输出文件。我想知道，输出文件在哪里？

我的代码（map.py）如下：

# path used on hadoop
tool = './blastx'
reference_path = 'Reference.fa'

# input format example

# >LW1           (contig name)
# ATCGATCGATCG   (sequence)

# samile file: https://goo.gl/XTauAx

(name, seq) = (None, None)

for line in sys.stdin:

    # when detact the ">" sign, assign contig name
    if line[0] == '>':
        name = line.strip()[1:]

    # otherwise, assign the sequence
    else:
        seq = line.strip()

        if name and seq:

            # assign the path of output file
            output_file = join(current_path, 'tmp_output', name)

            # blast command example (export out file to a given path)
            command = 'echo -e \">%s\\n%s\" | %s -db %s -out %s -evalue 1e-10 -num_threads 16' % (name, seq, tool, reference_path, output_file)

            # execute command with python subprocess
            cmd = Popen(command, stdin=PIPE, stdout=PIPE, shell=True)

            # retrieve the standard output of command
            cmd_out, cmd_err = cmd.communicate()

            print '%s\t%s' % (name, output_file)

调用blast的命令是：

command = 'echo -e \">%s\\n%s\" | %s -db %s -out %s -evalue 1e-10 -num_threads 16' % (name, seq, tool, reference_path, output_file)

通常输出文件位于output_file的路径中，但我无法在本地文件系统和hdfs上找到它们。它们似乎是在临时目录中创建的，并在执行后消失。我该如何检索它们？

Answer 1

我找到了爆炸的输出文件。它们似乎停留在执行爆炸的节点中。所以在我把它们放回到hdfs后，我可以在目录/user/yarn下访问它们。我所做的是将以下代码添加到map.py：

command = 'hadoop fs -put %s' % output_file
cmd = Popen(command, stdin=PIPE, stdout=PIPE, shell=True)

我还将输出路径修改为

output_file = name

而不是使用

output_file = join(current_path, 'tmp_output', name)

[3/3更新] 但是将文件放在用户yarn目录下并不好，因为普通用户没有权限编辑该目录下的文件。我建议通过将命令更改为

将文件放入/tmp/blast_tmp

command = 'hadoop fs -put %s /tmp/blast_tmp' % output_file

在此之前，应使用

创建目录/tmp/blast_tmp

% hadoop fs -mkdir /tmp/blast_tmp

并通过

更改目录的权限

% hadoop fs -chmod 777 /tmp/blast_tmp

在这种情况下，用户纱线和您都可以访问该目录。

hadoop流中的python子进程的输出文件在哪里

1 个答案: