使用XPath提取作者元数据

时间:2019-05-27 15:47:59

标签: c# xpath web-scraping

我有一系列HTML文件,其中作者名称分为meta标签中的名字和姓氏。

我当前的HTML结构如下,我想以正确的方式提取作者的名字和姓氏,以便可以使用此数据为这些HTML文件建立索引。 HTML文档之间的作者数量可能有所不同。

<!doctype html>  
 <html lang="en">  
 <head>
 <title>Title of document</title>
 <meta charset="utf-8" />  
 <meta name="dcterms.title" content="The science papers title" />  

<meta name="author" />
    <meta name="firstname" content="Eddard" />
    <meta name="lastname" content="Stark" />

<meta name="author" />
    <meta name="firstname" content="Tywin" />
    <meta name="lastname" content="Lannister" />

<meta name="author" />
    <meta name="firstname" content="Jon" />
    <meta name="lastname" content="Snow" />

 <meta name="dcterms.subject" content="The articles subject" />  
 <meta name="description" content="The articles description, abstract or introduction" />  
 <meta name="keywords" content="keyword1, keyword2, keyword3" />
</head>  

我有蜜蜂使用C#和XPath抓取此数据,试图弄清楚如何正确获取数据。我的问题是,我似乎无法弄清楚如何提取元数据并使它看起来像这样,因此以后可以在生成XML时使用每个字符串变量。

var author_1: Eddard Stark
var author_2: Tywin Lannister
var author_3: Jon Snow

我当前使用XPath的C#测试平台。

            var url = "<URL TO DOCUMENT>";     
            var web = new HtmlWeb();
            var doc = web.Load(url);
            var navigator = (HtmlAgilityPack.HtmlNodeNavigator)doc.CreateNavigator();

            // Xpaths
            var authors_list = doc.DocumentNode.SelectSingleNode("//meta[@name='author']");
            var authors_FirstName = "//meta[@name='author']/following::meta[1]/@content";
            var authors_LastName = "//meta[@name='lastname']/@content";

            // Laboratory
            var listOfAuthorsXpath = "//meta[@name='author']/following::meta[1]/@content";
            var nodes = doc.DocumentNode.SelectNodes(listOfAuthorsXpath);

            // SelectNodes
            var firstName = navigator.SelectSingleNode(authors_FirstName);
            var lastName = navigator.SelectSingleNode(authors_LastName);

            // Print to screen
            Console.WriteLine(firstName.Value + " " +lastName.Value);

            //Console.WriteLine(doc.DocumentNode.InnerHtml);
            Console.ReadKey();

1 个答案:

答案 0 :(得分:1)

[更新后的答案]
(请注意,您共享的XML不是有效的XML,缺少</html>

通过此代码段,您可以获得所需的信息:

using System;
using System.Collections.Generic;
using System.Xml;

namespace XPath
{
    class MainClass
    {
        public static void Main()
        {
            XmlDocument doc = new XmlDocument();
            doc.Load(".... your file");

            // Xpaths
            XmlNode root = doc.DocumentElement;
            // Xpaths
            XmlNode root = doc.DocumentElement;
            XmlNodeList xmlFirstNameNodeList = root.SelectNodes("//html/head/meta[@name='firstname']");
            XmlNodeList xmlLastNameNodeList = root.SelectNodes("//html/head/meta[@name='lastname']");

            List<String> authors = new List<String>();

            for(int i=0; i<xmlFirstNameNodeList.Count; i++) {
                authors.Add(xmlFirstNameNodeList[i].Attributes["content"].Value + " " + xmlLastNameNodeList[i].Attributes["content"].Value);
            }

            Console.ReadKey();

        }
    }
}

列表作者的内容:

authors[0] = "Eddard Stark"
authors[1] = "Tywin Lannister"
authors[2] = "Jon Snow"