从特定字符后的文本中提取每一行,并使用postgres将结果提取到表中

时间:2017-04-21 09:53:40

标签: postgresql fasta

我的文字看起来像这样:

>Sequenz: Test 1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG

>Sequenz 2 1234 Organism: Treponema
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG

>Sequenz 3
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG

文本块之间不一定有空行,可能是'MTEITAAMVKELRESTGAGM'的不同行数。唯一可以确定的是每行之前>

我希望得到一个像这样的表:

HEADER 
----------
Sequenz: Test 1 
----------
Sequenz 2 1234 Organism: Treponema
----------
Sequenz 3

我试过了:

SELECT regexp_matches(regexp_split_to_table( 'text from above', '\n>'),'([A-Z,a-z,0-9]+\s)');

导致

HEADER
----------
Sequenz
----------
Sequenz
----------
Sequenz

Select regexp_split_to_table('text from bove', '[\\\n>+(.)\\\n]+')

导致

HEADER
----------
----------
Sequenz: Test 1 
----------
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
----------
----------
Sequenz 2 1234 Organism: Treponema 
----------
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
----------
----------
Sequenz 3 
----------
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG

1 个答案:

答案 0 :(得分:1)

试试这个:

SELECT split_part(regexp_split_to_table(trim(leading '>' from '>Sequenz: Test 1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG

>Sequenz 2 1234 Organism: Treponema
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG

>Sequenz 3
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG'), E'>'),E'\n', 1) AS res

如果要保留第一个空行,请删除trim()函数。

演示:http://rextester.com/LQXY98290