是否可能测序 - uima ruta

时间:2016-06-14 12:21:16

标签: uima ruta

是否可以在uima ruta进行测序。 例如:

输入文件:

some text 
Fig 1.1
Table 1.1
Fig 1.2
some text
Pic 1.2
Table 1.2
some text
Table 1.3
Pic 1.3
some text
Fig 1.4
some text
Table 1.4
some text
Table 1.5
Fig 1.6
Box 1.1
Fig 1.5

如何找到缺失的图(图1.3)

2 个答案:

答案 0 :(得分:0)

这是一个如何用UIMA Ruta 2.5.0完成的例子。

输入文字:

some text 
Fig 1.1
some text
Pic 1.2
some text
Pic 1.3
some text
Fig 1.4
some text

规则脚本:

DECLARE FigureInd;
DECLARE FigureMention (INT chapter, INT section);

ACTION FM(INT chap, INT sect) = CREATE(FigureMention, "chapter" = chap, "section" = sect);

"Fig"-> FigureInd;

INT c, s;
(FigureInd NUM{PARSE(c)} PERIOD NUM{PARSE(s)}){-> FM(c,s)};

DECLARE FigMissing;
f1:FigureMention #{-> FigMissing} f2:FigureMention
    {f1.chapter == f2.chapter, f1.section < (f2.section - 1)};

INT pc, ps;
f:FigureMention{-> pc=f.chapter, ps=f.section} 
    FigMissing->{
    (ANY @NUM{PARSE(c)} PERIOD NUM{PARSE(s)}){c==pc,s==ps+1-> FM(c,s), pc=c, ps=s};
    };

创建了FigureMention注释:

Fig 1.1
Pic 1.2
Pic 1.3
Fig 1.4

UIMA Ruta 2.4.0的解决方案非常相似,但不允许直接使用注释标签表达式的功能。这些功能的值需要存储在其他变量中。并且需要在变量的setter之后应用布尔检查。这是UIMA Ruta 2.4.0的解决方案:

DECLARE FigureInd;
DECLARE FigureMention (INT chapter, INT section);

ACTION FM(INT chap, INT sect) = CREATE(FigureMention, "chapter" = chap, "section" = sect);

"Fig"-> FigureInd;

INT c, s;
(FigureInd NUM{PARSE(c)} PERIOD NUM{PARSE(s)}){-> FM(c,s)};

DECLARE FigMissing;
INT c1,c2,s1,s2;
(FigureMention<-{FigureMention{-> ASSIGN(c1, FigureMention.chapter), ASSIGN(s1, FigureMention.section)};} 
    #{-> FigMissing} 
    FigureMention<-{FigureMention{-> ASSIGN(c2, FigureMention.chapter), ASSIGN(s2, FigureMention.section)};}) 
    {c1 == (c2), s1 < (s2 - 1)};

INT pc, ps;
f:FigureMention{-> pc=FigureMention.chapter, ps=FigureMention.section} 
    FigMissing->{
    (ANY @NUM{PARSE(c)} PERIOD NUM{PARSE(s)}){c==(pc),s==(ps+1)-> FM(c,s), pc=c, ps=s};
    };

(免责声明:我是UIMA Ruta的开发人员)

答案 1 :(得分:0)

以下脚本创建一个注释,其中包含UIMA Ruta 2.4.0中缺失数字的最小值和最大值:

DECLARE FigureInd;
DECLARE FigureMention (INT chapter, INT section);
DECLARE FigureMissing (INT minChapter, INT minSection, INT maxChapter, INT maxSection);

ACTION Mention(INT chap, INT sect) = CREATE(FigureMention, "chapter" = chap, "section" = sect);
ACTION Missing(INT minc, INT mins, INT maxc, INT maxs) = CREATE(FigureMissing, "minChapter" = minc, "minSection" = mins, "maxChapter" = maxc, "maxSection" = maxs);

"Fig"-> FigureInd;

INT c, s;
(FigureInd NUM{PARSE(c)} PERIOD NUM{PARSE(s)}){-> Mention(c,s)};

DECLARE FigMissing;
INT c1,c2,s1,s2;
(FigureMention<-{FigureMention{-> ASSIGN(c1, FigureMention.chapter), ASSIGN(s1, FigureMention.section)};} 
    #{-> Missing(c1,s1+1,c2,s2-1)} 
    FigureMention<-{FigureMention{-> ASSIGN(c2, FigureMention.chapter), ASSIGN(s2, FigureMention.section)};}) 
    {c1 == (c2), s1 < (s2 - 1)};

在UIMA Ruta中,布尔表达式(如while)没有循环,只有现有注释。这使得在相同偏移上为每个缺失的图创建单独的注释变得更加复杂。但是,它可以使用递归BLOCK完成。答案的脚本反而创建了一个定义一系列缺失数字的注释。

对于问题的文本例外,创建了两个FigureMissing注释:

FigureMissing
- begin: 41
- end: 112
- minChapter: 1
- minSection: 3
- maxChapter: 1
- maxSection: 3

FigureMissing
- begin: 123
- end: 165
- minChapter: 1
- minSection: 5
- maxChapter: 1
- maxSection: 5

如果不应创建第二个FigureMissing,则根据现有的FigureMentions,附加规则可以再次删除它。如果已经创建了单独的FirgureMssing注释,例如使用BLOCK,这当然会更简单。

免责声明:我是UIMA Ruta的开发者