Question

我正在尝试删除包含特定文本字符串的所有html元素（标记）。我有2376个html文档，都有不同的doctype标准。有些甚至没有doctype（可能与此问题无关）。

所以，我正在寻找一个文字字符串，上面写着“如何引用这篇论文”，我发现它被包含在<p>-tag，<h4>-tag或{{1 }}

<legend>-tag通常看起来像这样，

<p>-tag

<p style="text-align : center; color : Red; font-weight : bold;">How to cite this paper:</i></p>通常看起来像这样，

<h4>-tag

<h4>How to cite this paper:</h4>Antunes, P., Costa, C.J. & Pino, J.A. (2006).看起来像这样，

<legend>-tag

手头的任务是找到这些标记并从文件中删除它们，然后再次保存文件。我确实需要删除更多标记，但需要一些帮助来理解HAP和XPath，以及如何根据其值或其他唯一数据定位特定标记。

到目前为止，我已经在C＃中提出了这个代码，它是一个控制台应用程序。这是我的主要（对不好的缩进），

<legend style="color: white; background-color: maroon; font-size: medium; padding: .1ex .5ex; border-right: 1px solid navy; border-bottom: 1px solid navy; font-weight: bold;">How to cite this paper</legend>

这是查看文件目录的私有方法，

//Variables
string Ext = "*.html";
string folder = @"D:\websites\dev.openjournal.tld\public\arkivet\";
IEnumerable<string> files = GetHTMLFiles(folder, Ext);
List<string> cite_files = new List<string>();            
var doc = new HtmlDocument();

//Loop to match all html-elements to query
foreach (var file in files)
{
 try
   {
      doc.Load(file);
      cite_files.Add(doc.DocumentNode.SelectNodes("//h4[contains(., 'How to cite this paper')]").ToString()); 

     cite_files.Add(doc.DocumentNode.SelectNodes("//p[contains(., 'How to cite this paper')]").ToString());
   }                
                    catch (Exception Ex)
                    {
                        Console.WriteLine(Ex.Message);
                    }
                }

                //Counts numbers of hits and prints data to user
                int filecount = files.Count();
                int citations = cite_files.Count();            
                Console.WriteLine("Number of files scanned: " + filecount);
                Console.WriteLine("Number of citations: {0}", citations);

                // Program end
                Console.WriteLine("Press any key to close program....");
                Console.ReadKey();

独特的东西似乎是“如何引用本文”，所以我试图找到包含这些确切单词的所有特定标签，然后将其删除。我的记事本显示应该有1094个文件与这个短语，所以我试图让他们全部。：）

任何帮助非常感谢！：）

Answer 1

Html Agility Pack支持LINQ选择器，在这种情况下非常方便。给出一些基于上面例子的HTML：

var html =
@"<html><head></head><body>

<!-- selector match: delete these nodes -->
<p style='text-align: center; color: Red; font-weight: bold;'>How to cite this paper:</i></p>
<h4> How to cite this paper:</h4> Antunes, P., Costa, C.J. & amp; Pino, J.A. (2006).
<legend style='color: white; background-color: maroon; font-size: medium; padding: .1ex .5ex; border-right: 1px solid navy; border-bottom: 1px solid navy; font-weight: bold;'>How to cite this paper </legend>
<div><p><i><b>How to cite this paper (NESTED)</b></i></p></div>

<!-- no match: keep these nodes -->
<p>DO NOT DELETE - How to cite</p>
<h4>DO NOT DELETE - cite this paper:</h4>
<legend>DO NOT DELETE</legend>

</body></html>";

您可以创建应搜索的标记集合，选择匹配的节点，然后将其删除，如下所示：

var tagsToDelete = new string[] { "p", "h4", "legend" };
var nodesToDelete = new List<HtmlNode>();

var document = new HtmlDocument();
document.LoadHtml(html);
foreach (var tag in tagsToDelete)
{
    nodesToDelete.AddRange(
        from searchText in document.DocumentNode.Descendants(tag)
            where searchText.InnerText.Contains("How to cite this paper")
            select searchText
    );
}

foreach (var node in nodesToDelete) node.Remove();

document.Save(OUTPUT);

具有以下结果：

<html><head></head><body>

<!-- XPath match: delete these nodes -->

 Antunes, P., Costa, C.J. & amp; Pino, J.A. (2006).

<div></div>

<!-- no match, keep these nodes -->
<p>DO NOT DELETE - How to cite</p>
<h4>DO NOT DELETE - cite this paper:</h4>
<legend>DO NOT DELETE</legend>

</body></html>

查找包含文本字符串的所有元素？

1 个答案: