Match exact closest string with regex

时间:2016-04-25 09:01:41

标签: c# regex

I have a string:

Test.
<div>
<table style="color:blue;"><tbody><!--START SPACE COMMENTS SUMMARY-->
<tr><td colspan="2">SPACE COMMENTS SUMMARY</td></tr>
<tr><td style="min-width:200px;">Area/Room</td>
<td style="max-width:300px;text-align:left;">Comments</td>
</tr><tr><td style="min-width:200px;">Bathroom</td>
<td style="max-width:300px;text-align:left;">Some comment</td></tr>
<!--END SPACE COMMENTS SUMMARY--></tbody></table>
<div>
<table style="color:blue;"><tbody><!--START SPACE SUMMARY-->
<tr><td colspan="2">SPACE SUMMARY</td></tr><tr>
<td style="min-width:200px;">Space</td>
<td style="max-width:300px;text-align:right;">Installed Price</td></tr>
<tr><td style="min-width:200px;">Bathroom</td>
<td style="max-width:300px;text-align:right;">$2,355.97</td></tr>
<!--END SPACE SUMMARY--></tbody></table>
<br><br><br><div>Some text.</div></div></div>

I want to select with regex a table that has comments <!--START SPACE SUMMARY> and <!--END SPACE SUMMARY-->.

I tried with @"<table.*?><tbody.*?><!--START SPACE SUMMARY>.*?<!--END SPACE SUMMARY--></tbody></table>", but it selects both tables in the string.

EDIT: My question doesn't have to do precisely with HTML. The same question will stand if I had a string:

some text blah blah one some text blah blah two.

And I want to select some text blah blah two with a pattern some text.*?two.

3 个答案:

答案 0 :(得分:1)

string test = @"Test.
    <div>
    <table style=""color:blue;""><tbody><!--START SPACE COMMENTS SUMMARY-->
    <tr><td colspan=""2"">SPACE COMMENTS SUMMARY</td></tr>
    <tr><td style=""min-width:200px;"">Area/Room</td>
    <td style=""max-width:300px;text-align:left;"">Comments</td>
    </tr><tr><td style=""min-width:200px;"">Bathroom</td>
    <td style=""max-width:300px;text-align:left;"">Some comment</td></tr>
    <!--END SPACE COMMENTS SUMMARY--></tbody></table>
    <div>
    <table style=""color:blue;""><tbody><!--START SPACE SUMMARY-->
    <tr><td colspan=""2"">SPACE SUMMARY</td></tr><tr>
    <td style=""min-width:200px;"">Space</td>
    <td style=""max-width:300px;text-align:right;"">Installed Price</td></tr>
    <tr><td style=""min-width:200px;"">Bathroom</td>
    <td style=""max-width:300px;text-align:right;"">$2,355.97</td></tr>
    <!--END SPACE SUMMARY--></tbody></table>
    <br><br><br><div>Some text.</div></div></div>";

MatchCollection matches = Regex.Matches(test, @"<table(?!.*<table).*?<!--START SPACE SUMMARY-->.*?<!--END SPACE SUMMARY-->.*?table>", RegexOptions.Singleline);

The idea is to use (?!.*<table) to tell Regex engine the the text within should not contain another table anchor.

答案 1 :(得分:1)

让我们关注您遇到的非HTML问题:匹配两个分隔符之间的最近窗口。使用tempered greedy token

(?s)some text(?:(?!some text|two).)*two
    |<-1st->||<----TG Token ------>||
                                    |2nd delimiter

请参阅regex demo

对于HTML解析,使用HtmlAgilityPack,这将使每个维护代码的人的生活更轻松。

(?s)匹配包含换行符的任何字符时.启用DOTALL模式,(?:(?!some text|two).)*淬火贪婪令牌将匹配任何不是some text的起始字符的字符或two文字字符序列。

答案 2 :(得分:0)

Try this:

<table.*?><tbody.*?><!--START (SPACE SUMMARY)>.*?<!--END \1--><\/tbody><\/table>

It should be done with non-greedy, but I try to use variable \1 here to repeat group 1 value. And also escape the / to \/. Maybe that's the problem source.