我经常需要使用从其他程序/代码生成的表,将大型multifasta解析为单个多个空间以进行下游对齐。
我有一个大型的multifasta(seq.fa):
-- rename the table
exec sp_rename big_table, _big_table;
go
drop view big_table
go
-- create the view with the name of the table
create view big_table
as
with
q as ( -- extract the query text
SELECT SUBSTRING(dest.text, (dem.statement_start_offset+2)/2, CASE WHEN dem.statement_end_offset=-1 THEN 8000 ELSE (dem.statement_end_offset-dem.statement_start_offset+4)/2 END) current_statement
FROM sys.dm_exec_requests dem CROSS APPLY sys.dm_exec_sql_text(dem.sql_handle) dest WHERE session_id = @@SPID
),
f as ( -- do some parsing to get WHERE condition
select
REPLACE(REPLACE(REPLACE(REPLACE(
SUBSTRING(current_statement, nullif(patindex('%WHERE%wide_col%=%''%''%', current_statement), 0)+5, 8000)
, CHAR(9), ' '), CHAR(10), ' '), CHAR(13), ' '), ' ', '') par
from q
where current_statement like '%WHERE%wide_col%=%''%''%'
),
r as ( -- some more parsig to get wide_col filter
select SUBSTRING(par, 1, charindex('''', par)-1) par
from (
select SUBSTRING(par, patindex('%wide_col=''%''%', par)+LEN('wide_col')+2, 8000) par
from f
where par like '%wide_col=''%''%'
) r
),
p as ( -- calc the checksum of the parameter
select par, iif(par is null, null, CHECKSUM(par)) chk
from r
),
x as ( -- lookup the id of the searched record
select m.id
from _big_table m
where wide_col_checksum = (select chk from p)),
z as ( -- test if a parameter was found (flag for normal operation)
select COUNT(*) n
from p
where chk is not null
)
-- this is the fast output for searched record
select m.*
from _big_table m, x
where (m.id = x.id) --OR (x.id is null)
union all
-- this is the normal output for all other conditions
select m.*
from _big_table m, z
where z.n = 0
我在第一列中有一个带有轨迹名称的tsv文件,以及后续列中的头文件列表。每行中的字段数可能不相等,因为一个物种可能没有它。但我可以轻松地为每个物种添加标题,并为遗漏数据添加>sp1_gene1
ATTAC
>sp1_gene2
CCATTA
...
>sp2_gene1
ATTAC
>sp1_gene2
TCGAGT
或类似内容。表(genes.tsv):
NA
我想使用基因表来创建单独的多快门(理想情况下,使用第一列中的名称),并使用标题和序列来获得如下内容:
geneA sp1_gene3 sp2_gene1
geneB sp1_gene5 sp2_gene7
...
我熟悉bash(awk,grep,sed),还在学习生物信息学的R和python。我最初将表格拆分为bash中的单个文件,将fasta转换为csv,然后进行grepping和join,但它确实非常混乱,并且始终无法正常工作。有关可以执行此操作的脚本或包的任何建议吗? 谢谢!
答案 0 :(得分:1)
这解决了我的问题:
sequences = {}
with open("seq.fa") as my_fasta:
for header in my_fasta:
seq = next(my_fasta)
sequences[header[1:].rstrip()] = seq.rstrip()
with open("genes.tsv") as my_tsv:
for line in my_tsv:
splitted_line = line.split()
gene_writer = open("/your/output/Dir/" + splitted_line[0] + ".fa", "w")
for gene in splitted_line[1:]:
if gene in sequences:
gene_writer.write(">" + gene + "\n")
gene_writer.write(sequences[gene] + "\n")
else:
print(gene, "in tsv file but not in fasta")
gene_writer.close()
打破它:
sequences = {}
with open("seq.fa") as my_fasta:
for header in my_fasta:
seq = next(my_fasta)
sequences[header[1:].rstrip()] = seq.rstrip()
这将创建一个带有基因名称键的字典sequences
,并对序列进行评估。像这样:
{'sp1_gene1': 'ATTAC', 'sp1_gene2': 'TCGAGT', 'sp2_gene1': 'ATTAC'}
代码的第二部分迭代TSV文件,并为每一行创建一个新的.fa
文件,并将fasta格式的序列添加到该文件中。
希望这会有所帮助。 :)