正则表达式在文本的各个部分匹配不同的符号相同的次数

时间:2012-09-21 06:41:22

标签: python regex bioinformatics

我需要解析对树有用的Newick格式。它看起来像一系列括号,逗号和字母表示节点:

(A,B,(C,D)E)F

或者,换句话说:

(,(((,(,)),),))

(,)元素表示具有相同父节点的节点。为了我的目的(测量两片叶子之间的路径长度),我需要寻找这样的嵌套元素。

那么,我的问题是如何匹配不同符号的相同次数?

例如,我想在字符串中匹配AB模式:

CCCAAABBACCCABCCAAABBBBBBACCCCCABBBABBCCAABB

正则表达式应该返回:['AABB','AB','AAABBB','AB','AB','AABB']

每次重复次数不同。所以A{n}B{n}不起作用。

感谢。

2 个答案:

答案 0 :(得分:1)

您的问题是正则表达式不能执行的经典示例。

http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages在“使用引理”一节中证明语言“a ^ nb ^ n”不规则(因此正则表达式无法识别)。

使用正则表达式,您只能为给定的最大n创建正则表达式。但是大型n的表达可能需要很长时间才能进行评估。

PS。您的问题可以使用形式语法(http://en.wikipedia.org/wiki/Formal_grammar)或Counter automaton(http://en.wikipedia.org/wiki/Counter_automaton)来解决。

答案 1 :(得分:0)

示例:简化括号

假设O =右括号,C =右括号,X =中间的某些表达式。

在某些情况下,如果左侧的O数量与右侧的C计数相同,我们只能简化。

我们仍然可以在这里使用RegExp:

将rx放在一个循环中,并且只匹配一个O和一个C对,重复操作前一个循环的输出,直到完全还原/分解为止。

const EXAMPLES =
`
OC
OOOOCOCCOCXCC
X
OCX
OXC
OOXCC
OOXXCC
OXOXXCC
OOXCOXXCC
OXCOXC
OXXCOXXC
`;

const is = v=>null!=v;
for (let input of EXAMPLES.trim().split(/\r?\n/g)) {
    console.log('input', JSON.stringify(input));
    let replaced;
    do {
        replaced = false;
        input = input.replace(/(?:OC|O(X)C|^(O{1,99})(X{1,99}(?:O{1,99}X{1,99}){0,99})(C{1,99})$)/gm, (...m) => {
            replaced = true;
            let out, clen;
            debugger;
            console.log(' m', JSON.stringify(m));
            if (is(m[2])) {
                clen = Math.min(m[2].length, m[4].length);
                out = m[2].substr(clen) + m[3] + m[4].substr(clen);
            }
            else {
                out = m.slice(1,-2).map(v=>null==v?'':v).join('');
            }
            console.log(' replaceWith', JSON.stringify(out));
            return out;
        });
    } while (replaced);
    console.log('output', JSON.stringify(input)+'\n')
}

输出:

input "OC"
 m ["OC",null,null,null,null,0,"OC"]
 replaceWith ""
output ""

input "OOOOCOCCOCXCC"
 m ["OC",null,null,null,null,3,"OOOOCOCCOCXCC"]
 replaceWith ""
 m ["OC",null,null,null,null,5,"OOOOCOCCOCXCC"]
 replaceWith ""
 m ["OC",null,null,null,null,8,"OOOOCOCCOCXCC"]
 replaceWith ""
 m ["OC",null,null,null,null,2,"OOOCXCC"]
 replaceWith ""
 m ["OOXCC",null,"OO","X","CC",0,"OOXCC"]
 replaceWith "X"
output "X"

input "X"
output "X"

input "OCX"
 m ["OC",null,null,null,null,0,"OCX"]
 replaceWith ""
output "X"

input "OXC"
 m ["OXC","X",null,null,null,0,"OXC"]
 replaceWith "X"
output "X"

input "OOXCC"
 m ["OOXCC",null,"OO","X","CC",0,"OOXCC"]
 replaceWith "X"
output "X"

input "OOXXCC"
 m ["OOXXCC",null,"OO","XX","CC",0,"OOXXCC"]
 replaceWith "XX"
output "XX"

input "OXOXXCC"
 m ["OXOXXCC",null,"O","XOXX","CC",0,"OXOXXCC"]
 replaceWith "XOXXC"
output "XOXXC"

input "OOXCOXXCC"
 m ["OXC","X",null,null,null,1,"OOXCOXXCC"]
 replaceWith "X"
 m ["OXOXXCC",null,"O","XOXX","CC",0,"OXOXXCC"]
 replaceWith "XOXXC"
output "XOXXC"

input "OXCOXC"
 m ["OXC","X",null,null,null,0,"OXCOXC"]
 replaceWith "X"
 m ["OXC","X",null,null,null,3,"OXCOXC"]
 replaceWith "X"
output "XX"

input "OXXCOXXC"
output "OXXCOXXC"