如何在列表中的单行上并行运行grep

时间:2016-08-24 08:58:46

标签: bash grep gnu-parallel

我是bash的初学者。我需要一些帮助才能提高工作效率。

while read line 
    do
        echo "$line"
        file="Species.$line"
        grep -A 1 "$line" /project/ag-grossart/ionescu/DB/rRNADB/SILVA_123.1_SSURef_one_line.fasta > $file
    done < species1

文件种类包含约100,000种名称。我正在搜索的文件是24 GB fasta(文本)文件。

大文件的格式为:

Domain;Phylum;Class;Order;Family;Genus;Species

AGCT ---- AGCT(每行50,000个字符)

以下是物种档案的样本(中间没有空行)

Alkanindiges_illinoisensis
Alkanindiges_sp._JJ005
Alligator_sinensis
Allisonella_histaminiformans
'Allium_cepa'
Alloactinosynnema_album
Alloactinosynnema_sp._Chem10
Alloactinosynnema_sp._CNBC1
Alloactinosynnema_sp._CNBC2
Alloactinosynnema_sp._FMA
Alloactinosynnema_sp._MN08-A0205
Allobacillus_halotolerans
Allochromatium_truperi
Allochromatium_vinosum

这是大文件的第一行:

HP451749.6.1794_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Pucciniomycotina;Pucciniomycetes;Pucciniales;Pucciniaceae;Puccinia;Puccinia_triticina.............................................................................-UC-U-G--G-U---------------------------
(this goes one for 50,000 characters per line)

以下是更多标题:

>EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens
>X96499.1.1810_Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta;Embryophyta;Marchantiophyta;Jungermanniales;Calypogeia;Plagiochila_adiantoides
>AB034906.1.1763_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Citeromyces;Citeromyces_siamensis
>AY290717.1.1208_Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales;Methanosarcinaceae;Methanohalophilus;Methanohalophilus_portucalensis_FDF-1
>EF164984.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_pulli
>AY291120.1.1477_Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Lampropedia;Lampropedia_hyalina
>EF164987.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
>JQ838073.1.1461_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS01
>EF164989.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
>JQ838076.1.1460_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS04
    >AB035584.1.1789_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Tremellomycetes;Tremellales;Trichosporonaceae;Trichosporon;Trichosporon_debeurmannianum
>JQ838080.1.1457_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS11
>EF165015.1.1527_Bacteria;Firmicutes;Clostridia;Clostridiales;Family_XI;Tepidimicrobium;Clostridium_sp._PML3-1
>U85867.1.1424_Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Marinobacter;Marinobacter_sp.
>EF165044.1.1398_Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Methylobacteriaceae;Methylobacterium;Methylobacterium_sp._CBMB38
>U85870.1.1458_Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas_sp.
>EF165046.1.1380_Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Pantoea;Pantoea_sp._CBMB55

我需要为每个物种提供一个包含所有匹配序列的文件。

上面的代码可以运行,但在16个小时内,它设法完成了不到2000种。

我想并行运行以加快速度。关于提高搜索效率的任何其他提示也是受欢迎的。

感谢

3 个答案:

答案 0 :(得分:2)

比我最初想的更棘手,因为匹配的行需要分开文件 - 如果你有机会请发表性能 - 这个解决方案也可以并行使用 - 物种列表文件可以分块和/或快速文件可以分块并送入脚本的并行运行

在Intel Xeon E5上需要大约1分钟,检查10,000种物品的6GB假数据文件 - 但即使在10,000个块中,物种列表也增加到100,0000是有问题的,因为我遇到了许多文件的磁盘问题正在创建并附加到一个目录中 - 问题在种类列表超过50,000时开始 - 这个数字在其他系统上会有所不同 - 我修改了脚本以创建100个子目录并将每个目录限制为1000个文件 - 这很好用,所有100,000个生成文件时无需将物种列表或6GB数据文件分块

另外,为了让您了解grep的速度有多快 - 在6GB文件中花费6秒钟来匹配100,000种物种

specieslist=$1
nspecies=$(wc -l $specieslist|cut -f1 -d' ')
echo -e "grep $nspecies species from $specieslist\n"
grep -A1 -F -f $specieslist|
awk '
# skip context marker
/^--$/{next}
# process pair of lines
# first line is matching species header line
# species is semicolon-delimited field 7 of first line
# second line is sequence - both lines are written to a file with sanitized species name
{
  split($0, flds, ";")
  species=flds[7]
  filekey=gensub(/\W/,".","g",species)
  file="fastaout." filekey
  if(!(filekey in outfiles))  {
    outfiles[filekey]=file
    printf("species \"%s\" outfile \"%s\" first match line %d: \"%s\"\n", species, file, NR, $0)
    print >file
  }
  getline; print >>file
# close may be needed on systems where awk cannot juggle too many open files
close(outfile)
}
'
outfiles=(fastaout.*)
noutfiles=${#outfiles[*]}
echo -e "\ncreated $noutfiles fastaout.* files"
head -5 fastaout*

输出和略微修改的测试输入如下 - 物种列表有一些实际匹配 - fasta文件序列行以小写物种为前缀以验证正确性并避免再次匹配物种

输出

$ head out.*
==> out.Brachyspira_innocens <==
brachyspira_innocens.1:-UC-U-G--G-U---------------------------
brachyspira_innocens.2:-UC-U-G--G-U---------------------------

==> out.Methanohalophilus_portucalensis_FDF-1 <==
methanohalophilus_portucalensis_fdf-1:-UC-U-G--G-U---------------------------

==> out.Pucciniomycotina <==
pucciniomycotina:-UC-U-G--G-U---------------------------

物种清单

Allobacillus_halotolerans
Allochromatium_truperi
Allochromatium_vinosum
Methanohalophilus_portucalensis_FDF-1
Brachyspira_innocens
Pucciniomycotina

fasta文件

HP451749.6.1794_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Pucciniomycotina;Pucciniomycetes;Pucciniales;Pucciniaceae;Puccinia;Puccinia_triticina;.............................................................................
pucciniomycotina:-UC-U-G--G-U---------------------------
>EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens
brachyspira_innocens.1:-UC-U-G--G-U---------------------------
>X96499.1.1810_Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta;Embryophyta;Marchantiophyta;Jungermanniales;Calypogeia;Plagiochila_adiantoides
plagiochila_adiantoides:-UC-U-G--G-U---------------------------
>AB034906.1.1763_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Citeromyces;Citeromyces_siamensis
citeromyces_siamensis:-UC-U-G--G-U---------------------------
>AY290717.1.1208_Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales;Methanosarcinaceae;Methanohalophilus;Methanohalophilus_portucalensis_FDF-1
methanohalophilus_portucalensis_fdf-1:-UC-U-G--G-U---------------------------
>EF164984.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_pulli
brachyspira_pulli:-UC-U-G--G-U---------------------------
>AY291120.1.1477_Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Lampropedia;Lampropedia_hyalina
lampropedia_hyalina:-UC-U-G--G-U---------------------------
>EF164987.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
brachyspira_alvinipulli:-UC-U-G--G-U---------------------------
>JQ838073.1.1461_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS01
streptomyces_sp._qls01:-UC-U-G--G-U---------------------------
>EF164989.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
brachyspira_alvinipulli:-UC-U-G--G-U---------------------------
>JQ838076.1.1460_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS04
streptomyces_sp._qls04:-UC-U-G--G-U---------------------------
>AB035584.1.1789_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Tremellomycetes;Tremellales;Trichosporonaceae;Trichosporon;Trichosporon_debeurmannianum
trichosporon_debeurmannianum:-UC-U-G--G-U---------------------------
>JQ838080.1.1457_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS11
streptomyces_sp._qls11:-UC-U-G--G-U---------------------------
>EF165015.1.1527_Bacteria;Firmicutes;Clostridia;Clostridiales;Family_XI;Tepidimicrobium;Clostridium_sp._PML3-1
clostridium_sp._pml3-1:-UC-U-G--G-U---------------------------
>U85867.1.1424_Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Marinobacter;Marinobacter_sp.
Marinobacter_sp.:-UC-U-G--G-U---------------------------
>EF165044.1.1398_Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Methylobacteriaceae;Methylobacterium;Methylobacterium_sp._CBMB38
methylobacterium_sp._cbmb38:-UC-U-G--G-U---------------------------
>U85870.1.1458_Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas_sp.
pseudomonas_sp.:-UC-U-G--G-U---------------------------
>EF165046.1.1380_Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Pantoea;Pantoea_sp._CBMB55
pantoea_sp._cbmb55:-UC-U-G--G-U---------------------------
>EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens
brachyspira_innocens.2:-UC-U-G--G-U---------------------------

答案 1 :(得分:1)

我可能会为此获得除shell + grep之外的其他东西。但是肯定并行化它将是一个很大的第一步。这是一个bash4 + awk解决方案:

# read all 100,000 species names into a shell array
mapfile -t species <species1  

# turn the names into a single big regular expression 
regex=${species[0]}$(printf '|%s' "${species[@]:1}")

# use awk to print the matching lines into the respective files
awk -F';' '($7 ~ /^('"$regex"')$/) { print >"Species."$7 }' bigfile.txt

答案 2 :(得分:0)

我没有尝试过反对这些大数据的AWK,但我有点好奇:

$ cat > spec.awk
NR==FNR {       # the species file
    species[$0] # read to an array "species"
    next
} 
# below, if the beginning of the last column (until first space) is found from
# the species array, write the whole row ($0) to a file named by the species.
match($NF,/^[^ ]+/) && (beginningof=substr($NF,RSTART,RLENGTH)) && (beginningof in species) {
    print $0 > beginningof
}

$ awk -f spec.awk spec large

它将所有种类读取到数组中,然后开始匹配大文件中最后一列的开头(以第一个空格结尾的字符串),如果找到匹配项,则写入整行(print $0 ,如果您只想要最后一列,请将$0替换为$NF)到物种命名的文件,即。 这可能会在一个目录中产生100000个文件。下面这个精确的测试文件产生了三个文件1,2和3:

$ cat large
foo;1 asd
bar;2 asd
foobar;3 asd
foo;1 asd
bar;2 asd
foo;bar;3 asd

免责声明:如果您不了解其工作原理,请不要剪切,粘贴和执行互联网代码。