Question

所以我在字符串中有一个HTML表。大多数HTML来自FrontPage，因此它的格式很糟糕。这是一个快速的样本。

<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      <p>Procedure Name</td>
    <td>
        <p>Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        <p>Procedure Name</td>
        <td>
        <p>Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>

根据我的理解，FrontPage会在每个新单元格中自动添加<p>。

我想删除在表中的那些<p>标记，但保留表格之外的那些标记。到目前为止我尝试了两种方法：

第一种方法

第一种方法是使用单个RegEx tp捕获表中的每个<p>标记，然后使用Regex.Replace()删除它们。但是我从来没有设法为此获得正确的RegEx。（我知道使用RegEx解析HTML很糟糕。我认为数据很简单，可以将RegEx应用到它。）

我可以使用此正则表达式轻松获取每个表格中的所有内容：<table.*?>(.*?)</table>

然后我想只抓取<p>标签，所以我写了这个：(?<=<table.*?>)(<p>)(?=</table>)。这与任何事情都不相符。（显然.NET允许量词在他们的外观中。至少那是我在使用http://regexhero.net/tester/时的印象。）

我可以通过任何方式修改此RegEx以仅捕获我需要的内容吗？

第二种方法

第二种方法是仅将表格内容捕获到字符串中，然后String.Replace()删除<p>标记。我正在使用以下代码来捕获匹配项：

MatchCollection tablematch = Regex.Matches(htmlSource, @"<table.*?>(.*?)</table>", RegexOptions.Singleline);

htmlSource是一个包含整个HTML页面的字符串，该变量将在处理后发送回客户端。我只想删除我需要从htmlSource删除的内容。

如何使用MatchCollection删除<p>代码，然后将更新的表格发送回htmlSource？

谢谢

Answer 1

这个答案是基于第二种建议的方法。更改正则表达式以匹配表中的所有内容：

<table.*?table>

并使用Regex.Replace指定MatchEvaluator使用所需的替换行为：

Regex myRegex = new Regex(@"<table.*?table>", RegexOptions.Singleline);
string replaced = myRegex.Replace(htmlSource, m=> m.Value.Replace("<p>",""));
Console.WriteLine(replaced);

使用问题输入输出：

<b>Table 1</b>
    <table class='class1'>
    <tr>
    <td>
        Procedure Name</td>
    <td>
        Procedure</td>
    </tr>
    </table>
<p><b>Table 2</b></p>
    <table class='class2'>
    <tr>
        <td>
        Procedure Name</td>
        <td>
        Procedure</td>
    </tr>
    </table>
<p> Some text is here</p>

Answer 2

我猜通过使用委托（回调）可以完成。

string html = @"
<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      <p>Procedure Name</td>
    <td>
        <p>Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        <p>Procedure Name</td>
        <td>
        <p>Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>
";

Regex RxTable = new Regex( @"(?s)(<table[^>]*>)(.+?)(</table\s*>)" );
Regex RxP = new Regex( @"<p>" );

string htmlNew = RxTable.Replace( 
    html,
    delegate(Match match)
    {
       return match.Groups[1].Value + RxP.Replace(match.Groups[2].Value, "") + match.Groups[3].Value;
    }
);
Console.WriteLine( htmlNew );

输出：

<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      Procedure Name</td>
    <td>
        Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        Procedure Name</td>
        <td>
        Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>

Answer 3

一般来说，正则表达式允许您使用嵌套结构，它非常难看，您应该避免使用它，但如果您没有其他选择，则可以使用它。

static void Main()
{
    string s = 
@"A()
{
    for()
    {
    }
    do
    {
    }
}
B()
{
    for()
    {
    }   
}
C()
{
    for()
    {
        for()
        {
        }
    }   
}";

    var r = new Regex(@"  
                      {                       
                          (                 
                              [^{}]           # everything except braces { }   
                              |
                              (?<open>  { )   # if { then push
                              |
                              (?<-open> } )   # if } then pop
                          )+
                          (?(open)(?!))       # true if stack is empty
                      }                                                                  

                    ", RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);

    int counter = 0;

    foreach (Match m in r.Matches(s))
        Console.WriteLine("Outer block #{0}\r\n{1}", ++counter, m.Value);

    Console.Read();
}

这里正则表达式＆＃34;知道＆＃34;块的开始位置和结束位置，因此如果<p>标签没有合适的关闭标记，您可以使用此信息。

删除部分Regex.Match字符串

第一种方法

第二种方法

3 个答案: