我需要解析对树有用的Newick格式。它看起来像一系列括号,逗号和字母表示节点:
(A,B,(C,D)E)F
或者,换句话说:
(,(((,(,)),),))
(,)
元素表示具有相同父节点的节点。为了我的目的(测量两片叶子之间的路径长度),我需要寻找这样的嵌套元素。
那么,我的问题是如何匹配不同符号的相同次数?
例如,我想在字符串中匹配AB
模式:
CCCAAABBACCCABCCAAABBBBBBACCCCCABBBABBCCAABB
正则表达式应该返回:['AABB','AB','AAABBB','AB','AB','AABB']
每次重复次数不同。所以A{n}B{n}
不起作用。
感谢。
答案 0 :(得分:1)
您的问题是正则表达式不能执行的经典示例。
http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages在“使用引理”一节中证明语言“a ^ nb ^ n”不规则(因此正则表达式无法识别)。
使用正则表达式,您只能为给定的最大n
创建正则表达式。但是大型n
的表达可能需要很长时间才能进行评估。
PS。您的问题可以使用形式语法(http://en.wikipedia.org/wiki/Formal_grammar)或Counter automaton(http://en.wikipedia.org/wiki/Counter_automaton)来解决。
答案 1 :(得分:0)
示例:简化括号
假设O =右括号,C =右括号,X =中间的某些表达式。
在某些情况下,如果左侧的O数量与右侧的C计数相同,我们只能简化。
我们仍然可以在这里使用RegExp:
将rx放在一个循环中,并且只匹配一个O和一个C对,重复操作前一个循环的输出,直到完全还原/分解为止。
const EXAMPLES =
`
OC
OOOOCOCCOCXCC
X
OCX
OXC
OOXCC
OOXXCC
OXOXXCC
OOXCOXXCC
OXCOXC
OXXCOXXC
`;
const is = v=>null!=v;
for (let input of EXAMPLES.trim().split(/\r?\n/g)) {
console.log('input', JSON.stringify(input));
let replaced;
do {
replaced = false;
input = input.replace(/(?:OC|O(X)C|^(O{1,99})(X{1,99}(?:O{1,99}X{1,99}){0,99})(C{1,99})$)/gm, (...m) => {
replaced = true;
let out, clen;
debugger;
console.log(' m', JSON.stringify(m));
if (is(m[2])) {
clen = Math.min(m[2].length, m[4].length);
out = m[2].substr(clen) + m[3] + m[4].substr(clen);
}
else {
out = m.slice(1,-2).map(v=>null==v?'':v).join('');
}
console.log(' replaceWith', JSON.stringify(out));
return out;
});
} while (replaced);
console.log('output', JSON.stringify(input)+'\n')
}
输出:
input "OC"
m ["OC",null,null,null,null,0,"OC"]
replaceWith ""
output ""
input "OOOOCOCCOCXCC"
m ["OC",null,null,null,null,3,"OOOOCOCCOCXCC"]
replaceWith ""
m ["OC",null,null,null,null,5,"OOOOCOCCOCXCC"]
replaceWith ""
m ["OC",null,null,null,null,8,"OOOOCOCCOCXCC"]
replaceWith ""
m ["OC",null,null,null,null,2,"OOOCXCC"]
replaceWith ""
m ["OOXCC",null,"OO","X","CC",0,"OOXCC"]
replaceWith "X"
output "X"
input "X"
output "X"
input "OCX"
m ["OC",null,null,null,null,0,"OCX"]
replaceWith ""
output "X"
input "OXC"
m ["OXC","X",null,null,null,0,"OXC"]
replaceWith "X"
output "X"
input "OOXCC"
m ["OOXCC",null,"OO","X","CC",0,"OOXCC"]
replaceWith "X"
output "X"
input "OOXXCC"
m ["OOXXCC",null,"OO","XX","CC",0,"OOXXCC"]
replaceWith "XX"
output "XX"
input "OXOXXCC"
m ["OXOXXCC",null,"O","XOXX","CC",0,"OXOXXCC"]
replaceWith "XOXXC"
output "XOXXC"
input "OOXCOXXCC"
m ["OXC","X",null,null,null,1,"OOXCOXXCC"]
replaceWith "X"
m ["OXOXXCC",null,"O","XOXX","CC",0,"OXOXXCC"]
replaceWith "XOXXC"
output "XOXXC"
input "OXCOXC"
m ["OXC","X",null,null,null,0,"OXCOXC"]
replaceWith "X"
m ["OXC","X",null,null,null,3,"OXCOXC"]
replaceWith "X"
output "XX"
input "OXXCOXXC"
output "OXXCOXXC"