使用表(csv,tsv)提取/解析大型multifasta到对齐

时间:2018-04-12 18:25:54

标签: bioinformatics biopython fasta bioconductor

我经常需要使用从其他程序/代码生成的表,将大型multifasta解析为单个多个空间以进行下游对齐。

我有一个大型的multifasta(seq.fa):

-- rename the table
exec sp_rename big_table, _big_table;
go

drop view big_table
go

-- create the view with the name of the table
create view big_table
as
with
q as ( -- extract the query text
    SELECT SUBSTRING(dest.text, (dem.statement_start_offset+2)/2, CASE WHEN dem.statement_end_offset=-1 THEN 8000 ELSE (dem.statement_end_offset-dem.statement_start_offset+4)/2 END) current_statement
    FROM   sys.dm_exec_requests dem CROSS APPLY sys.dm_exec_sql_text(dem.sql_handle) dest  WHERE  session_id = @@SPID
),
f as ( -- do some parsing to get WHERE condition
    select 
        REPLACE(REPLACE(REPLACE(REPLACE(
            SUBSTRING(current_statement, nullif(patindex('%WHERE%wide_col%=%''%''%', current_statement), 0)+5, 8000)
        , CHAR(9), ' '), CHAR(10), ' '), CHAR(13), ' '), ' ', '') par 
        from q 
        where current_statement like '%WHERE%wide_col%=%''%''%'
),
r as ( -- some more parsig to get wide_col filter
    select SUBSTRING(par, 1, charindex('''', par)-1) par
    from (
        select SUBSTRING(par, patindex('%wide_col=''%''%', par)+LEN('wide_col')+2, 8000) par
        from f
        where par like '%wide_col=''%''%'
    ) r
),
p as ( -- calc the checksum of the parameter
    select par, iif(par is null, null, CHECKSUM(par)) chk 
    from r
),
x as ( -- lookup the id of the searched record
    select m.id 
    from _big_table m 
    where wide_col_checksum = (select chk from p)),
z as ( -- test if a parameter was found (flag for normal operation)
    select COUNT(*) n 
    from p 
    where chk is not null
)

-- this is the fast output for searched record
select m.*
from _big_table m, x
where (m.id = x.id) --OR (x.id is null) 

union all

-- this is the normal output for all other conditions
select m.*
from _big_table m, z
where z.n = 0

我在第一列中有一个带有轨迹名称的tsv文件,以及后续列中的头文件列表。每行中的字段数可能不相等,因为一个物种可能没有它。但我可以轻松地为每个物种添加标题,并为遗漏数据添加>sp1_gene1 ATTAC >sp1_gene2 CCATTA ... >sp2_gene1 ATTAC >sp1_gene2 TCGAGT 或类似内容。表(genes.tsv):

NA

我想使用基因表来创建单独的多快门(理想情况下,使用第一列中的名称),并使用标题和序列来获得如下内容:

geneA    sp1_gene3    sp2_gene1
geneB    sp1_gene5    sp2_gene7
...

我熟悉bash(awk,grep,sed),还在学习生物信息学的R和python。我最初将表格拆分为bash中的单个文件,将fasta转换为csv,然后进行grepping和join,但它确实非常混乱,并且始终无法正常工作。有关可以执行此操作的脚本或包的任何建议吗? 谢谢!

1 个答案:

答案 0 :(得分:1)

这解决了我的问题:

sequences = {}

with open("seq.fa") as my_fasta:
    for header in my_fasta:
        seq = next(my_fasta)
        sequences[header[1:].rstrip()] = seq.rstrip()

with open("genes.tsv") as my_tsv:
    for line in my_tsv:
        splitted_line = line.split()
        gene_writer = open("/your/output/Dir/" + splitted_line[0] + ".fa", "w")
        for gene in splitted_line[1:]:
            if gene in sequences:
                gene_writer.write(">" + gene + "\n")
                gene_writer.write(sequences[gene] + "\n")
            else:
                print(gene, "in tsv file but not in fasta")
        gene_writer.close()

打破它:

sequences = {}

with open("seq.fa") as my_fasta:
    for header in my_fasta:
        seq = next(my_fasta)
        sequences[header[1:].rstrip()] = seq.rstrip()

这将创建一个带有基因名称键的字典sequences,并对序列进行评估。像这样:

{'sp1_gene1': 'ATTAC', 'sp1_gene2': 'TCGAGT', 'sp2_gene1': 'ATTAC'}

代码的第二部分迭代TSV文件,并为每一行创建一个新的.fa文件,并将fasta格式的序列添加到该文件中。

希望这会有所帮助。 :)