使用Ruta,为每个用户创建注释器,其中以下字段作为表格结构中的功能。
要提取的字段为UserName, UserId, DateReport, Dt.Open, Limit,UType, Balance, Term, Due, status, Source
对于第一个实例,userName = BWERTIEW LOAN SERVICING, userId = 621234546899999
样本输入:
User Date Dt.Open Limit Balance Due Mor 30+ 60+ 90+ status
Report DLA UType Terms source
J J BWERTIEW LOAN SERVICING 02/12 01/08 $12345 $21234 $0 11 0 0 0 M1
621234546899999 07/09 MIG 120 $217 XA/TQ/EE
SECOND VALUE
J J COST/HOME 01/10 02/14 $21235 $123456 $0 16 0 0 0 L1
12345634561001 01/15 AUTO 071 $126 XX/TQ/EF
C C GLOBAL 10/11 11/12 $3400 $123 $0 31 0 0 0 QA
51234"°"*** 03/13 RIV MRN $111 XP/TU/EF
Late Dates: 1/15-30, 1/11-12
CHANGE
J J QWERTYFIN 01/09 01/11 $12345 $0 $0 23 0 0 0 Mi
1234558200189 01/13 MIG 130 $0 XP/TU/EF
BY ANOTHER; USER DEACTIVATED
方法:
如果空间大于2个单位,则创建Cells
。通过检查正则表达式标记startIndicator
,并为开始指示符之间的数据创建DataRow
。无法将最后一行注释为DataRow
。遍历所有DataRow
并根据其从startIndicator
的空间获取数据
代码:
PACKAGE uima.mq;
TYPESYSTEM utils.PlainTextTypeSystem;
ENGINE utils.PlainTextAnnotator;
DECLARE Header;
DECLARE ColumnDelimiter;
DECLARE Cell(INT column);
DECLARE Keyword (STRING label);
DECLARE Entry(Keyword keyword);
EXEC(PlainTextAnnotator, {Line,Paragraph});
ADDRETAINTYPE(WS);
Line{->TRIM(WS)};
Paragraph{->TRIM(WS)};
SPACE[2,100]{-PARTOF(ColumnDelimiter) -> ColumnDelimiter};
Line -> {ANY+{-PARTOF(Cell),-PARTOF(ColumnDelimiter) -> Cell};};
REMOVERETAINTYPE(WS);
DECLARE StartIndicator;
Cell{REGEXP("[A-z]\\s[A-z]\\s[\\w*\\s]+[\\w\\/]*") -> MARK(StartIndicator)};
DECLARE DataRow;
(StartIndicator #){-> DataRow} StartIndicator;
DECLARE DateIndicator;
Cell{REGEXP("[0-9]{2}\\/[0-9]{2}") -> MARK(DateIndicator)};
DECLARE CurrencyIndicator;
Cell{REGEXP("\\$[0-9]+") -> MARK(CurrencyIndicator)};
查询