想要从Annotation-UIMA RUTA中删除标记

时间:2016-08-27 03:55:17

标签: uima ruta

如果我使用P标签(来自Html Annotator)作为PASSAGE。我想忽略标注中的标记。

SCRIPT:

//-------------------------------------------------------------------
// SPECIAL SQUARE HYPHEN PARENTHESIS
//-------------------------------------------------------------------
DECLARE LParen, RParen;
SPECIAL{REGEXP("[(]") -> MARK(LParen)};
SPECIAL{REGEXP("[)]") -> MARK(RParen)};

DECLARE LSQParen, RSQParen;
SPECIAL{REGEXP("[\\[]") -> MARK(LSQParen)};
SPECIAL{REGEXP("[\\]]") -> MARK(RSQParen)};

DECLARE LANGLEBRACKET,RANGLEBRACKET;
SPECIAL{REGEXP("<")->MARK(LANGLEBRACKET)};
AMP{REGEXP("&lt;")->MARK(LANGLEBRACKET)};
SPECIAL{REGEXP(">")->MARK(RANGLEBRACKET)};
AMP{REGEXP("&gt;")->MARK(RANGLEBRACKET)};

DECLARE LBracket,RBracket;

(LParen|LSQParen|LANGLEBRACKET){->MARK(LBracket)};
(RParen|RSQParen|RANGLEBRACKET){->MARK(RBracket)};


DECLARE PASSAGE,TESTPASSAGE;

       "<a name=\"para(.+?)\">(.*?)</a>"->2=PASSAGE;

 RETAINTYPE(WS); // or RETAINTYPE(SPACE, BREAK,...);
 PASSAGE{-> TRIM(WS)};
 RETAINTYPE;

  PASSAGE{->MARK(TESTPASSAGE)};



DECLARE TagContent,PassageFirstToken,InitialTag;
LBracket ANY+? RBracket{-PARTOF(TagContent)->MARK(TagContent,1,3)}; 


 BLOCK(foreach)PASSAGE{}
{
Document{->MARKFIRST(PassageFirstToken)};
}   
TagContent{CONTAINS(PassageFirstToken),-PARTOF(InitialTag)->MARK(InitialTag)};


BLOCK(foreach)PASSAGE{}
{
InitialTag  ANY+{->SHIFT(PASSAGE,2,2)};

}

示例输入:

<p class="Normal"><a name="para1"><h1><b>On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. </b></a></p>

<p class="Normal"><a name="para2"><aus>On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document.</a></p>

<p class="Normal"><a name="para3">On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document.</a></p>

<p class="Normal"><a name="para4">On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. </a></p>

<p class="Normal"><a name="para5">On the Insert tab, the <span>galleries</span> include items that are designed to coordinate with the overall look of your document.</a></p>

通行(5)和TESTPASSAGE(2)。为什么TESTPASSAGE减少了?并且没有标记InitialTag。

enter image description here 我附加了输出注释图像

2 个答案:

答案 0 :(得分:2)

  //-------------------------------------------------------------------
// SPECIAL SQUARE HYPHEN PARENTHESIS
//-------------------------------------------------------------------
DECLARE LParen, RParen;
SPECIAL{REGEXP("[(]") -> MARK(LParen)};
SPECIAL{REGEXP("[)]") -> MARK(RParen)};

DECLARE LSQParen, RSQParen;
SPECIAL{REGEXP("[\\[]") -> MARK(LSQParen)};
SPECIAL{REGEXP("[\\]]") -> MARK(RSQParen)};

DECLARE LANGLEBRACKET,RANGLEBRACKET;
SPECIAL{REGEXP("<")->MARK(LANGLEBRACKET)};
AMP{REGEXP("&lt;")->MARK(LANGLEBRACKET)};
SPECIAL{REGEXP(">")->MARK(RANGLEBRACKET)};
AMP{REGEXP("&gt;")->MARK(RANGLEBRACKET)};

DECLARE LBracket,RBracket;

(LParen|LSQParen|LANGLEBRACKET){->MARK(LBracket)};
(RParen|RSQParen|RANGLEBRACKET){->MARK(RBracket)};


DECLARE PASSAGE,TESTPASSAGE;

       "<a name=\"para(.+?)\">(.*?)</a>"->2=PASSAGE;

 RETAINTYPE(WS); // or RETAINTYPE(SPACE, BREAK,...);
 PASSAGE{-> TRIM(WS)};
 RETAINTYPE;

  PASSAGE{->MARK(TESTPASSAGE)};



DECLARE TagContent,PassageFirstToken,InitialTag;
LBracket ANY+? RBracket{-PARTOF(TagContent)->MARK(TagContent,1,3)}; 


 BLOCK(foreach)PASSAGE{}
{
Document{->MARKFIRST(PassageFirstToken)};
}   
TagContent{CONTAINS(PassageFirstToken),-PARTOF(InitialTag)->MARK(InitialTag)};


BLOCK(foreach)PASSAGE{}
{
InitialTag  ANY+{->SHIFT(PASSAGE,2,2)};

}

答案 1 :(得分:2)

当再现给定的例子时,我得到5个PASSAGE注释和3个TESTPASSAGE注释(最后三个PASSAGE注释)。其他两个PASSAGE注释未使用TESTPASSAGE注释,因为它们以MARKUP注释开始,默认情况下不可见,并使完整注释不可见。为了避免这个问题,你可以通过PASSAGE注释使MARKUP可见或修剪标记(这实际上是主要问题吗?)。只需扩展TRIM操作的规则:

RETAINTYPE(WS, MARKUP);
PASSAGE{-> TRIM(WS, MARKUP)};
RETAINTYPE;

没有InitialTag注释,因为没有TagContent注释,因为示例中没有LBracket注释。

不过,你可以改写一些规则:

PASSAGE{->MARKFIRST(PassageFirstToken)};

(LBracket # RBracket){-PARTOF(TagContent)-> TagContent}; 

免责声明:我是UIMA Ruta的开发者