在for和while循环中使用gnu并行限制分叉的samtools进程,包括awk

时间:2018-04-05 08:42:06

标签: bash loops fork gnu-parallel samtools

我正在尝试限制并行化脚本。脚本的目的是获取10个样本/文件夹中的列表,并使用列表的记录来执行samtools命令,这是最苛刻的部分。

这是简单的版本:

for (10 items)
do
  while read (list 5000 items)
  do
    command 1
    command 2
    command 3
    ...
    samtools view -L input1 input2 |many_pipes_including_'awk' > output_file &
    ### TODO (WARNING): currently all processes are forked at the same time. this needs to be resolved. limit to a certain number of processes.
  done
done

为了使用我们的本地服务器,该脚本包含一个forking命令,该命令有效。但它会分叉,直到所有服务器的资源都被使用,没有其他人可以使用它。

因此我希望用gnu parallel实现parallel -j 50之类的东西。我在待分叉samtools命令(如

)前面尝试过它
parallel -j 50 -k samtools view -L input1 input2 |many_pipes_including_'awk' > output_file &

哪个不起作用(也尝试用反叛),我得到了

[main_samview] region "item_from_list" specifies an unknown reference name. Continue anyway.

或以某种方式调用了vim。但我也不确定这是否是脚本中parallel命令的正确位置。您是否知道如何解决此问题,以便分叉的进程数量有限?

我还考虑过使用基于FIFO的信号量来实现https://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop/103921中提到的内容,但我希望gnu parallel可以做我正在寻找的东西?我查看了更多页面,例如https://zvfak.blogspot.de/2012/02/samtools-in-parallel.htmlhttps://davetang.org/muse/2013/11/18/using-gnu-parallel/,但通常不是这些问题的组合。

这是脚本的更详细版本,如果其中的任何命令可能相关(我听说awk反引号和新行通常可能有问题?)

cd path_to_data
for SAMPLE_FOLDER in *
do
cd ${SAMPLE_FOLDER}/another_folder
    echo "$SAMPLE_FOLDER was found"

    cat list_with_products.txt | while read PRODUCT_NAME_NO_SPACES 
    do
    PRODUCT_NAME=`echo ${PRODUCT_NAME_NO_SPACES} | tr "@" " "`
        echo "$PRODUCT_NAME with white spaces"
        BED_FILENAME=${BED_DIR}/intersect_${PRODUCT_NAME_NO_SPACES}_${SAMPLE_FOLDER}.bed
        grep "$PRODUCT_NAME" file_to_search_through > ${TMP_DIR}/tmp.gff

        cat ${TMP_DIR}/tmp.gff | some 'awk' command  > ${BED_FILENAME}

        samtools view -L ${BED_FILENAME} another_input_file.bam | many | pipes | with | 'awk' | and | perl | etc > resultfolder/resultfile &
          ### TODO (WARNING): currently all processes are forked at the same time. this needs to be resolved. limit to a certain number of processes.
        rm ${TMP_DIR}/tmp.gff
    done
    cd (back_to_start)
done
rmdir -p ${OUTPUT_DIR}/tmp

1 个答案:

答案 0 :(得分:1)

首先制作一个将单个样本+单个产品作为输入的函数:

list_with_products.txt

如果parallel --results outputdir/ doit {1} {2} {#} ::: * :::: path/to/list_with_products.txt 对于每个样本都相同:

list_with_products.txt

如果每个样本的# Generate a list of: # sample \t product parallel --tag cd {}\;cat list_with_products.txt ::: * | # call doit on each sample,product. Put output in outputdir parallel --results outputdir/ --colsep '\t' doit {1} {2} {#} 不同:

from collections import MutableSequence
from itertools import chain, islice

class ChainedListProxy(MutableSequence):
  def __init__(self, *lists):
    self._lists=lists

  def _resolve_element(self, index):
    """ returning list and subindex in that list """
    for l in self._lists:
      if index>=len(l):
        index-=len(l)
      else:
        return l, index
    raise IndexError('index out of range')

  def __getitem__(self, index):
    l, i=self._resolve_element(index)
    return l[i]

  def __delitem__(self, index):
    l, i=self._resolve_element(index)
    del l[i]

  def __setitem__(self, index, value):
    if isinstance(index, slice):
      indicies=index.indices(len(self))
    l, i=self._resolve_element(index)
    l[i]=value

  def insert(self, index, value):
    l, i=self._resolve_element(index)
    l.insert(i, value)

  def __len__(self):
    return sum( (len(l) for l in self._lists) )