我试图将法律文件从古老的SGML文件转移到数据库中。在java中使用正则表达式,我运气很好。但是,我遇到了一个小问题。看来文件的每个部分的标签不是文件之间的标准。例如,最常见的标签是:
(<numeric>)
(<alpah>)
(<ROMAN>)
(<ALPHA>)
实施例。 (1)(A)(Ⅰ)(A)
但是,还有其他文档有变化,可能会抛出()。我当前的算法具有与每个级别的每个元素匹配的硬编码RegEx。但我需要一种方法来动态设置每个级别的标签类型,因为我在文档中工作。
有没有人遇到这样的问题?有没有人有任何建议?
提前致谢。
修改
以下是我用来解析不同项目的RegEx:
Section: ^<tab>(<b>)?\d{1,4}(\.\d+)?-((\d{1,4}(\.\d+)?)(-|\.)?){3}
SubSection: \.?\s*(<\/b>|<tab>|^)\s*\(\d+(\.\d+)?\)\s+($|<b>|[A-Z"]|\([a-z](.\d+)?\)\s*(\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s*(\([A-Z](.\d+)?\))?)?\s*.)
Paragraph: (^|<tab>|\s+|\(\d+(\.\d+)?\)\s+)\([a-z](.\d+)?\)(\s+$|\s+<b>|\s+[A-Z"]|\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)(\([A-Z](.\d+)?\))?\s*[A-Z"]?)
SubParagraph: (\)|<tab>|<\/b>)\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s+($|[A-Z"<]|\([A-Z](.\d+)?\)\s*[A-Z"])
SubSubParagraph: (<tab>|\)\s*)\([A-Z](.\d+)?\)\s+([A-Z"]|$)
这是一些示例文本。我错过了早些时候。虽然数据的最终来源是SGML,但我解析的内容略有不同。除了样式标签外,它或多或少都是纯文本。
<tab><b>SECTION 5.</b> In Colorado Revised Statutes, 13-5-142, <b>amend</b> (1)
introductory portion, (1)(b), and (3)(b)(II) as follows:
<tab><b>13-5-142. National instant criminal background check system - reporting.</b>
(1) On and after March 20, 2013, the state court administrator shall send electronically
the following information to the Colorado bureau of investigation created pursuant to
section 24-33.5-401, referred to in this section as the "bureau":
<tab>(b) The name of each person who has been committed by order of the court to the
custody of the office of behavioral health in the department of human services pursuant
to section 27-81-112 or 27-82-108; and
<tab>(3) The state court administrator shall take all necessary steps to cancel a record
made by the state court administrator in the national instant criminal background check
system if:
<tab>(b) No less than three years before the date of the written request:
<tab>(II) The period of commitment of the most recent order of commitment or
recommitment expired, or a court entered an order terminating the person's incapacity or
discharging the person from commitment in the nature of habeas corpus, if the record in
the national instant criminal background check system is based on an order of
commitment to the custody of the office of behavioral health in the department of human
services; except that the state court administrator shall not cancel any record pertaining to
a person with respect to whom two recommitment orders have been entered pursuant to
section 27-81-112 (7) and (8), or who was discharged from treatment pursuant to section
27-81-112 (11) on the grounds that further treatment is not likely to bring about
significant improvement in the person's condition; or
答案 0 :(得分:1)
您对问题的陈述含糊不清,因此唯一可能的答案是一般方法。我已经处理过像这样不精确格式化的文档转换。
来自CS的可以提供帮助的工具是状态机。如果您可以检测(例如使用正则表达式)格式正在更改为新约定,那么它是合适的。这会改变状态,在这种情况下,状态等同于在当前和后续文本块上使用的转换器。它一直有效,直到下一次状态改变。整体算法如下:
translator = DEFAULT
while (chunks of input remain) {
chunk = GetNextChunkOfInput // a line, paragraph, etc.
new_translator = ScanChunkForStateChange(chunk, translator)
if (new_translator != null) translator = new_translator // found a state change!
print(translator.Translate(chunk)) // use the translator on the chunk
}
在这个框架内,设计翻译人员和状态变更谓词是一个繁琐的过程。您所希望做的就是尝试,检查输出并解决问题,重复直到您无法做到更好。那时你可能已经在输入中发现了一个最大结构,所以单独使用模式匹配的算法(不试图用AI来模拟语义)不会让你走得太远。
答案 1 :(得分:0)
您发布的文本片段可以由SGML解析器解析和构建,其中DOCTYPE
也称为DTD中的自定义语法规则(假设您的示例中的<tab>
表示实际的tab
开头 - 元素标记而不是TAB字符)。我已将您的代码段存储在名为data.ent
的文件中,然后创建了以下SGML文件doc.sgm
,引用它:
<!DOCTYPE doc [
<!ELEMENT doc O O (tab)+>
<!ELEMENT tab - O (((b,c?)|c),text)>
<!ELEMENT text O O (#PCDATA|b)+>
<!ELEMENT b - - (#PCDATA)>
<!ELEMENT c - - (#PCDATA)>
<!ENTITY data SYSTEM "data.ent">
<!ENTITY startc "<c>">
<!ENTITY endc "</c>">
<!SHORTREF intab "(" startc ")" endc>
<!USEMAP intab tab>
<!USEMAP #EMPTY text>
]>
&data
使用这些DTD规则解析数据的结果(在命令行上使用osgmlnorm doc.sgm
)如下:
<DOC>
<TAB>
<B>SECTION 5.</B>
<TEXT>In Colorado Revised Statutes, 13-5-142, <B>amend</B> (1)
introductory portion, (1)(b), and (3)(b)(II) as follows:
</TEXT>
</TAB>
<TAB>
<B>13-5-142. National instant criminal background check system
reporting.</B>
<C>1</C>
<TEXT>On and after March 20, 2013, the state court administrator
shall send electronically the following information to the
Colorado bureau of investigation created pursuant to section
24-33.5-401, referred to in this section as the "bureau":
</TEXT>
</TAB>
<TAB>
<C>b</C>
<TEXT>The name of each person who has been committed by order
of the court to the custody of the office of behavioral health
in the department of human services pursuant to section 27-81-112
or 27-82-108; and
</TEXT>
</TAB>
<TAB>
<C>3</C>
<TEXT>The state court administrator shall take all necessary steps
to cancel a record made by the state court administrator in the
national instant criminal background check system if:
</TEXT>
</TAB>
<TAB>
<C>b</C>
<TEXT>No less than three years before the date of the written
request:
</TEXT>
</TAB>
<TAB>
<C>II</C>
<TEXT>The period of commitment of the most recent order of
commitment or recommitment expired, or a court entered an order
terminating the person's incapacity or discharging the person
from commitment in the nature of habeas corpus, if the record in
the national instant criminal background check system is based on
an order of commitment to the custody of the office of behavioral
health in the department of human services; except that the state
court administrator shall not cancel any record pertaining to
a person with respect to whom two recommitment orders have been
entered pursuant to section 27-81-112 (7) and (8), or who was
discharged from treatment pursuant to section 27-81-112 (11) on
the grounds that further treatment is not likely to bring about
significant improvement in the person's condition; or
</TEXT>
</TAB>
</DOC>
说明:
DOC
element作为文档元素,以及人工TEXT
和C
元素;
主要目的是将文档结构强加为一系列
TAB
个元素,每个元素都包含一个部分标识符(例如
<b>SECTION 5.</b>
或(c)
),后跟部分正文C
放入大括号的文字((
和)
字符);起始端元素
C
的标签由SGML处理器自动插入
DTD的SHORTREF
映射规则;这些告诉SGML在TAB
内
元素,SGML应该用值替换所有(
个字符
startc
实体(扩展为<C>
),以及所有)
个字符
endc
实体的值(扩展为</C>
)<!USEMAP #EMPTY text>
关闭了括号的扩展
TEXT
部分的TAB
正文部分,以便引用(7)
,(8)
正文文本不会被更改(尽管这些可以更改为类似HTML的
链接以及使用SGML)如果您使用<tab>
代表TAB(ASCII 9)字符,SGML也可以处理它,例如。通过使用类似于显示的<TAB>
规则将TAB字符转换为SHORTREF
代码。
注意您需要安装osgmlnorm
程序;如果您使用的是Ubuntu,可以使用sudo apt-get install opensp
安装它,而在其他Linux版本和Mac OS上使用类似的安装。对于您的应用程序,您可能希望使用osx
程序(也是OpenSP的一部分)将规范化的解析结果输出到XML(尽管上面显示的输出已经可以解析为XML),然后使用Java XML API,根据您的需要处理结构化内容。