删除两个元素之间的所有元素

时间:2018-02-20 18:15:26

标签: c# html xpath html-agility-pack

我有大约2500个不同标准的html文件。我需要删除它们的页脚部分。下面的HTML代码是我的文件页脚之一,我需要删除两个hr元素和两者之间的元素。

到目前为止,我只尝试使用xpath(和HTML Agility Pack)selectSingleNodeDocumentNode.SelectNodes("//hr");定位hr元素。然后尝试用foreach迭代。 但是我太过粗暴地使用XPath,并且不知道如何选择节点及其兄弟节点(?)来删除它们。

这是我在这个社区的帮助下到目前为止所得到的。 :)

private static void RemoveHR(IEnumerable<string> files)
{
    var document = new HtmlDocument();
    List<string> hr = new List<string>();
    List<string> errors = new List<string>();
    int i = 0;
    foreach (var file in files)
    {
        try
        {
            document.Load(@file);
            i++;
            var hrs = document.DocumentNode.SelectNodes("//hr");
            foreach (var hr in hrs) hr.Remove();
            document.Save(@file);

        }
        catch (Exception Ex)
        {
            errors.Add(file + "|" + Ex.Message);
        }
    }
    using (StreamWriter logger = File.CreateText(@"D:\websites\dev.openjournal.tld\public\arkivet\ErrorLogs\hr_error_log.txt"))
    {
        foreach (var file in errors)
        {
            logger.WriteLine(file);
        }
    }
    int nrOfHr = hr.Count();
    int nrOfErrors = errors.Count();
    Console.WriteLine("Number of hr elements collected: {0}", nrOfHr);
    Console.WriteLine("Number of files missing hr element: {0}", nrOfErrors);
}

HTML源:

<hr color=#ff00ff SIZE=3> //start element
<p style="text-align : center; color : Red; font-weight : bold;">How to cite this paper:</i></p>
<p style="text-align : left; color : black;">Ekmek&ccedil;ioglu, F. &Ccedil;una, Lynch, Michael F. &amp; Willett, Peter   (1996)&nbsp; &quot;Stemming and N-gram matching for term conflation in Turkish texts&quot;&nbsp;<em>Information Research</em>, <strong>1</strong>(1) Available at: http://informationr.net/ir/2-2/paper13.html</p>
<p style="text-align : center">&copy; the authors, 1996.</p>
<hr color="#ff00ff" size="1"><div align="center">Check for citations, <a href="http://scholar.google.co.uk/scholar?hl=en&amp;q=http://informationr.net/ir/2-2/paper13.html&amp;btnG=Search&amp;as_sdt=2000">using Google Scholar</a></div>
                                 <hr color="#ff00ff" size="1">
<table border="0" cellpadding="15" cellspacing="0" align="center">
<tr> 
    <td><a href="infres22.html"><h4>Contents</h4></a></td>
    <td align="center" valign="top"><h5 align="center"><IMG SRC="http://counter.digits.net/wc/-d/-z/6/-b/FF0033/paper13" ALIGN=middle  WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 VSPACE=2><br><a href="http://www.digits.net/ ">Web Counter</a><br>Counting only since 13 December 2002</h5></td>
    <td><a href="http://InformationR.net/ir/"><h4>Home</h4></a></td>
</tr>
</table>
<hr color=#ff00ff SIZE=3> //end element

EDIT 我尝试了使用previous-sibling和follow-sibling来定位节点。不幸的是,它不包括列表中的目标节点。

var footerTags = document.DocumentNode.SelectNodes("//*[preceding-sibling::p[contains(text(),'How to cite this')] and following-sibling::hr[@color = '#ff00ff']]");

它找到带有“如何引用此”的文本的段落,然后选择它之间的所有节点,并选择颜色为“ff00ff”的hr。但是不包括要删除的列表中的实际选定节点,并且需要将它们与所选节点一起删除。

2 个答案:

答案 0 :(得分:1)

我想,你期待这个,

<强>代码

string content = System.IO.File.ReadAllText(@"D:\New Text Document.txt");
string html = Regex.Replace(content, "<hr.*?>", "", RegexOptions.Singleline);

<强>结果

//start element
<p style="text-align : center; color : Red; font-weight : bold;">How to cite this paper:</i></p>
<p style="text-align : left; color : black;">Ekmek&ccedil;ioglu, F. &Ccedil;una, Lynch, Michael F. &amp; Willett, Peter   (1996)&nbsp; &quot;Stemming and N-gram matching for term conflation in Turkish texts&quot;&nbsp;<em>Information Research</em>, <strong>1</strong>(1) Available at: http://informationr.net/ir/2-2/paper13.html</p>
<p style="text-align : center">&copy; the authors, 1996.</p>
<div align="center">Check for citations, <a href="http://scholar.google.co.uk/scholar?hl=en&amp;q=http://informationr.net/ir/2-2/paper13.html&amp;btnG=Search&amp;as_sdt=2000">using Google Scholar</a></div>

<table border="0" cellpadding="15" cellspacing="0" align="center">
<tr> 
    <td><a href="infres22.html"><h4>Contents</h4></a></td>
    <td align="center" valign="top"><h5 align="center"><IMG SRC="http://counter.digits.net/wc/-d/-z/6/-b/FF0033/paper13" ALIGN=middle  WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 VSPACE=2><br><a href="http://www.digits.net/ ">Web Counter</a><br>Counting only since 13 December 2002</h5></td>
    <td><a href="http://InformationR.net/ir/"><h4>Home</h4></a></td>
</tr>
</table>
 //end element

答案 1 :(得分:1)

假设 start end 节点 真正相同 (相同的标签名称,属性和属性)正如你在上面的评论中所提到的,它并不太难:

  1. 选择开始节点。
  2. 迭代并删除每个兄弟,包括结束节点。
  3. 删除开始节点。
  4. 示例HTML:

    var html =
    @"<!doctype html system 'html.dtd'>
    <html><head></head>
    <body>
    
    <div>DO NOT DELETE</div>
    
    <hr color=""#ff00ff"" SIZE='3'> //start element
    <p style='text-align : center; color : Red; font-weight : bold;'>How to cite this paper:</i></p>
    <p style='text-align : left; color : black;'>Ekmek&ccedil;ioglu, F. &Ccedil;una, Lynch, Michael F. &amp; Willett, Peter   (1996)&nbsp; &quot;Stemming and N-gram matching for term conflation in Turkish texts&quot;&nbsp;<em>Information Research</em>, <strong>1</strong>(1) Available at: http://informationr.net/ir/2-2/paper13.html</p>
    <p style='text-align : center'>&copy; the authors, 1996.</p>
    <hr color='#ff00ff' size='1'><div align='center'>Check for citations, <a href='http://scholar.google.co.uk/scholar?hl=en&amp;q=http://informationr.net/ir/2-2/paper13.html&amp;btnG=Search&amp;as_sdt=2000'>using Google Scholar</a></div>
                                     <hr color='#ff00ff' size='1'>
    <table border='0' cellpadding='15' cellspacing='0' align='center'>
    <tr> 
        <td><a href='infres22.html'><h4>Contents</h4></a></td>
        <td align='center' valign='top'><h5 align='center'><IMG SRC='http://counter.digits.net/wc/-d/-z/6/-b/FF0033/paper13' ALIGN=middle  WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 VSPACE=2><br><a href='http://www.digits.net/'>Web Counter</a><br>Counting only since 13 December 2002</h5></td>
        <td><a href='http://InformationR.net/ir/'><h4>Home</h4></a></td>
    </tr>
    </table>
    <hr COLOR='#ff00ff' SIZE=""3""> //end element
    
    <div>DO NOT DELETE</div>
    </body>
    </html>";
    

    解析它:

    var document = new HtmlDocument();
    document.LoadHtml(html);
    var startNode = document.DocumentNode.SelectSingleNode("//hr[@size='3'][@color='#ff00ff']");
    // account for mismatched quotes in HTML source
    var quotesRegex = new Regex("[\"']");
    var startNodeNoQuotes = quotesRegex.Replace(startNode.OuterHtml, "");
    HtmlNode siblingNode;
    
    while ( (siblingNode = startNode.NextSibling) != null)
    {
        siblingNode.Remove();
        if (quotesRegex.Replace(siblingNode.OuterHtml, "") == startNodeNoQuotes)
        {
            break;  // end node
        }
    }
    
    startNode.Remove();
    

    结果输出:

    <!doctype html system 'html.dtd'>
    <html><head></head>
    <body>
    
    <div>DO NOT DELETE</div>
    
     //end element
    
    <div>DO NOT DELETE</div>
    </body>
    </html>