I am working with XML data in .NET from the Federal Register, which contain many references to Executive Orders & chapters from the U.S. Code.
I'd like to be able to hyperlink to these references, unless they're already inside of an <a>
tag (which is determined by the XML, and often links within the document itself).
The pattern I've written is matching and deleting leading and trailing characters and not displaying them, even if I include the boundary character in the replacement string:
[?!<a href="#(.*)">]([0-9]{1,2})[ ]{0,1}(U\.S\.C\.|USC)[\s]{0,1}([0-9]{1,5})(\b)[^</a>]
An example of the initial XML:
<p>The Regulatory Flexibility Act of 1980 (RFA), 5 U.S.C. 604(b), as amended, requires Federal agencies to consider the potential impact of regulations on small entities during rulemaking.</p>
<p>Small entities include small businesses, small not-for-profit organizations, and small governmental jurisdictions.</p>
<p>Section 605 of the RFA allows an agency to certify a rule, in lieu of preparing an analysis, if the rulemaking is not expected to have a significant economic impact on a substantial number of small entities. Reference: <a href="#1">13 USC 401</a></p>
<ul>
<li><em>Related laws from 14USC301-345 do not apply.</em></li>
<li><a href="#2">14 USC 301</a> does apply.</li>
</ul>
As you can see, some references include ranges of U.S. Code sections (e.g. 14 USC 301-345) or references to specific subsections (e.g. 5 U.S.C. 604(b) ). I'd only want to link to the first reference in the range, so the link should terminate at the -
or the (
.
答案 0 :(得分:0)
如果我正确理解你,我认为以下内容应该有效。
var re = new Regex(@"\d{1,2}\s?U\.?S\.?C\.?\s?\d{1,5}\b(?!</a>)");
var matches = re.Matches(text);
// matches[0].Value = 5 U.S.C. 604
// matches[1].Value = 14USC301
您甚至可以将正则表达式简化为\d+\s?U\.?S\.?C\.?\s?\d+\b(?!</a>)
- 我不确定2和5的上限是否显着。