Hima Annotator,Uima Ruta的Html转换器

时间:2016-05-11 06:22:16

标签: uima ruta

任何人都可以通过一些例子简要解释一下Html注释器,Html转换器和TEIViewWriter。我想在初始视图中创建注释。

等待答案。

主脚本:

 PACKAGE uima.ruta.example;
 SCRIPT uima.ruta.example.Html;
 Document{-> EXEC(Html)};
 WORDLIST JOURNALNAMELIST='JournalName.txt';
 WORDLIST CITYPUBLIST='CITYPUB.txt';
 DECLARE JOURNALNAME;
 DECLARE CITYPUB;
 Document{ -> MARKFAST(JOURNALNAME, JOURNALNAMELIST)};
 Document{ -> MARKFAST(CITYPUB, CITYPUBLIST)};
 DECLARE Reference;
 "<a name=para(.+?)>(.+?)</a>"-> 2=Reference;
 DECLARE FirstToken, LastToken;

 BLOCK(InRef) Reference{}
 {
 ANY{POSITION(Reference,1) -> MARK(FirstToken)};
 Document{-> MARKLAST(LastToken)};
 }
 DECLARE FIRSTWORD;
 FirstToken PERIOD CW {->MARK(FIRSTWORD)};

Html脚本:

 PACKAGE uima.ruta.example;
 ENGINE utils.HtmlAnnotator;
 ENGINE utils.HtmlConverter;
 ENGINE utils.HtmlViewWriter;
 TYPESYSTEM utils.HtmlTypeSystem;
 TYPESYSTEM utils.SourceDocumentInformation;
 Document{-> EXEC(HtmlAnnotator)};
 Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView","outputView" = "plain"),
 EXEC(HtmlConverter)};
 Document{ -> CONFIGURE(HtmlViewWriter, "inputView" = "plain","outputView" = "_InitialView", "output" = "E:/ruta-2.4.0-source-release/ruta-2.4.0/example-projects/TextRulerExample/output"),
 EXEC(HtmlViewWriter)};

示例Html输入文件:(通过更改扩展名手动转换为html)

<html>
<head>
 <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
 <meta name=Generator content="Microsoft Word 14 (filtered)">
 <style>
 <!--
/* Font Definitions */
 @font-face
 {font-family:Calibri;
 panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
 {margin-top:0in;
 margin-right:0in;
 margin-bottom:10.0pt;
 margin-left:0in;
 line-height:115%;
 font-size:11.0pt;
 font-family:"Calibri","sans-serif";}
span.DAZZLEFN
 {mso-style-name:DAZZLEFN;}
span.DAZZLELN
 {mso-style-name:DAZZLELN;
 color:#92D050;}
.MsoChpDefault
 {font-family:"Calibri","sans-serif";}
.MsoPapDefault
 {margin-bottom:10.0pt;
 line-height:115%;}
@page WordSection1
 {size:8.5in 11.0in;
 margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
 {page:WordSection1;}
-->
</style>

</head>

<body lang=EN-US>

<div class=WordSection1>

<p class=MsoNormal><a name=para0>REFERENCES</a></p>

 <p class=MsoNormal><a name=para1>1.����������� Lawrence RA. A        review of the
 medical benefits and contraindications to breastfeeding in the United    States
 [Internet] . Arlington (VA): National Center for Education in Maternal and
 Child Health; 1997 Oct [cited 2000 Apr 24]. p. 40. Available from:
 www.ncemch.org/pubs/PDFs/Welcometojungle.pdf.</a></p>

 <p class=MsoNormal><a name=para2>2.����������� Shishido A.  Retraction notice:
 Effect of platinum compounds on murine lymphocyte mitogenesis [Retraction of
 Alsabti EA, Ghalib ON, Salem MH. In: Jpn J Med Biol 1979 Apr; 32(2):53-65].      Jpn
 J Med Sci Biol 1980 Aug;33(4):235-237.</a></p>

 <p class=MsoNormal><a name=para3>3.����������� Leist TP,  Zinkernagel RM.
 Effects of treatment with IL-2 receptor specific monoclonal antibody in mice
 [letter] [Retraction of Leist TP, Kohler M, Eppler M, Zinkernagel RM. In: J
 Immunol 1989 Jul 15; 143(2): 628-32]. J Immunol 1990 Apr 1;144(7):2847.</a>  </p>

 <p class=MsoNormal><a name=para4>4.����������� Alsabti EA, Ghalib     ON, Salem MH.
 Effect of platinum compounds on murine lymphocyte mitogenesis [Retracted by
 Shishido A. In: Jpn J Med Sci Biol 1980 Aug; 33(4):235-7]. Jpn J Med Sci  Biol
 1979 Apr;32(2):53-65.</a></p>

 <p class=MsoNormal><a name=para5>5.����������� Tidy JA, Parry GC, Ward P,
 Coleman DV, Peto J, Malcolm AD, Farrell PJ. High rate of papillomavirus type 16
 infection in cytologically normal cervices [letter] [Retracted by Tidy J,
 Farrell PJ. In: Lancet 1989 Dec 23-30:2(8678-8679):1535]. Lancet 1989 Feb   25;1(8635):434.</a></p>

 <p class=MsoNormal><a name=para6>6.����������� Magni F, Rossoni G,  Berti F.
 BN-52021 protects guinea-pig from heard anaphylaxis. Pharm Res Commun 1988
 Dec;20 Suppl 5:75-78.</a></p>

 <p class=MsoNormal><a name=para7>7.����������� Garvia EE, DeHaven ED. An
 experimental analysis of response acquisition and elimination with positive
 reinforcers. Behav Neuropsychiatry 1975 a April-1976 May;7(1-12):71-78.</a>  </p>

 <p class=MsoNormal><a name=para8>8.����������� Mueller FO,   Schindler RD. Annual
 survey of football injury research 1931-1985. [place unknown]: American
 Football Coaches Assn; 1986. 24 p.</a></p>

 <p class=MsoNormal><a name=para9>9.����������� Stern, Michael P.   National
 Institute of Arthritis, Diabetes, and Digestive and Kidney Diseases.   Diabetes
 in America: diabetes data compiled 1984.. [Bethesda (MD)]: The Institute; 1985
 Aug. Diabetes in Hispanic Americans. Chapter 9. (NIH publication; no. 86- 1468).</a></p>

 <p class=MsoNormal><a name=para10>10.��������� Vivian, Valerie L,      editor. Child
 abuse and neglect: a medical community response. 1st AMA National   Conference on
 Child Abuse and Neglect; 1984 March 30-June 31; Chicago. Chicago: American
 Medical Association; 1985. 256 p.</a></p>

 <p class=MsoNormal><a name=para11>11.��������� Popper, Hans, et al.,   editors.
 Structural carbohydrates in the liver: proceedings of the 34th Falk   Symposium;
 1982 oct 12-19; Basil, Switzerland.Boston: MTB Press; 1983. 701 p.</a></p>

 <p class=MsoNormal><a name=para12></a>&nbsp;</p>

 </div>

 </body>

 </html>

1 个答案:

答案 0 :(得分:0)

请注意,您的示例脚本不包含提到的TEIViewWriter。然而,问题是一样的。

不幸的是,示例脚本有错误:

该行

Document{ -> CONFIGURE(ViewWriter, "inputView" = "plain",...

应该阅读

Document{ -> CONFIGURE(HtmlViewWriter, "inputView" = "plain",

......然后NPE消失了。如果输入文本不能被HtmlParser解析,则可能会出现另一个异常,导致XMI文件中缺少一个Sofa。将文本包装在这里可能有所帮助。

文件HtmlConverter.ruta和TEIConverter.ruta here确实是这些组件的好例子 HtmlAnnotator为HTML和XML标记/元素创建注释。 HtmlConverter删除所有HTML / XML标记,将生成的文本存储在新视图中并重新计算注释的偏移量。 TEIViewWriter只是一个具有特定类型系统的ViewWriter,它将特定视图复制到新CAS并存储它。这些组件一起能够将TEI / Html / XML文本转换为带有xml标记注释的纯文本。

documentation包含更多信息,例如有关配置参数的信息

免责声明:我是UIMA Ruta的开发者