HtmlAgilityPack - 摆脱html评论标签之间的广告

时间:2014-08-31 21:09:14

标签: c# xpath html-agility-pack

我需要摆脱<!-- custom ads --><!-- /custom ads -->之间的部分  在此代码段中。

<!-- custom ads -->
<div style="float:left">
  <!-- custom_Forum_Postbit_336x280 -->
  <div id='div-gpt-ad-1526374586789-2' style='width:336px; height:280px;'>
    <script type='text/javascript'>
       googletag.display('div-gpt-ad-1526374586789-2');
    </script>
  </div>
</div>
<div style="float:left; padding-left:20px">
  <!-- custom_Forum_Postbit_336x280_r -->
  <div id='div-gpt-ad-1526374586789-3' style='width:336px; height:280px;'>
    <script type='text/javascript'>
      googletag.display('div-gpt-ad-1526374586789-3');
    </script>
   </div>
</div>
<div class="clear"></div>

 <br>
<!-- /custom ads -->


<!-- google_ad_section_start -->Some Text,<br>
Some More Text...<br>
<!-- google_ad_section_end -->

我已经可以使用此xPath //comment()[contains(., 'custom')]找到两条评论,但现在我仍然坚持如何删除所有内容,这些内容位于这些“标记”之间。

        foreach (var comment in htmlDoc.DocumentNode.SelectNodes("//comment()[contains(., 'custom')]"))
        {
            MessageBox.Show(comment.OuterHtml);
        }

有什么建议吗?

1 个答案:

答案 0 :(得分:3)

//find all comment nodes that contain "custom ads"
var nodes = doc.DocumentNode
               .Descendants()
               .OfType<HtmlCommentNode>()
               .Where(c => c.Comment.Contains("custom ads"))
               .ToList();
//create a sequence of pairs of nodes
var nodePairs = nodes
    .Select((node, index) => new {node, index})
    .GroupBy(x => x.index / 2)
    .Select(g => g.ToArray())
    .Select(a => new { startComment = a[0].node, endComment = a[1].node});

foreach (var pair in nodePairs)
{
    var startNode = pair.startComment;
    var endNode = pair.endComment;
    //check they share the same parent or the wheels will fall off
    if(startNode.ParentNode != endNode.ParentNode) throw new Exception();
    //iterate all nodes inbetween
    var currentNode = startNode.NextSibling;
    while(currentNode != endNode)
    {
        //currentNode won't have siblings when we trim it from the doc
        //so grab the nextSibling while it's still attached
        var n = currentNode.NextSibling;
        //and cut out currentNode
        currentNode.Remove();
        currentNode = n;
    }
}