我有一系列HTML文件,其中作者名称分为meta标签中的名字和姓氏。
我当前的HTML结构如下,我想以正确的方式提取作者的名字和姓氏,以便可以使用此数据为这些HTML文件建立索引。 HTML文档之间的作者数量可能有所不同。
<!doctype html>
<html lang="en">
<head>
<title>Title of document</title>
<meta charset="utf-8" />
<meta name="dcterms.title" content="The science papers title" />
<meta name="author" />
<meta name="firstname" content="Eddard" />
<meta name="lastname" content="Stark" />
<meta name="author" />
<meta name="firstname" content="Tywin" />
<meta name="lastname" content="Lannister" />
<meta name="author" />
<meta name="firstname" content="Jon" />
<meta name="lastname" content="Snow" />
<meta name="dcterms.subject" content="The articles subject" />
<meta name="description" content="The articles description, abstract or introduction" />
<meta name="keywords" content="keyword1, keyword2, keyword3" />
</head>
我有蜜蜂使用C#和XPath抓取此数据,试图弄清楚如何正确获取数据。我的问题是,我似乎无法弄清楚如何提取元数据并使它看起来像这样,因此以后可以在生成XML时使用每个字符串变量。
var author_1: Eddard Stark
var author_2: Tywin Lannister
var author_3: Jon Snow
我当前使用XPath的C#测试平台。
var url = "<URL TO DOCUMENT>";
var web = new HtmlWeb();
var doc = web.Load(url);
var navigator = (HtmlAgilityPack.HtmlNodeNavigator)doc.CreateNavigator();
// Xpaths
var authors_list = doc.DocumentNode.SelectSingleNode("//meta[@name='author']");
var authors_FirstName = "//meta[@name='author']/following::meta[1]/@content";
var authors_LastName = "//meta[@name='lastname']/@content";
// Laboratory
var listOfAuthorsXpath = "//meta[@name='author']/following::meta[1]/@content";
var nodes = doc.DocumentNode.SelectNodes(listOfAuthorsXpath);
// SelectNodes
var firstName = navigator.SelectSingleNode(authors_FirstName);
var lastName = navigator.SelectSingleNode(authors_LastName);
// Print to screen
Console.WriteLine(firstName.Value + " " +lastName.Value);
//Console.WriteLine(doc.DocumentNode.InnerHtml);
Console.ReadKey();
答案 0 :(得分:1)
[更新后的答案]
(请注意,您共享的XML不是有效的XML,缺少</html>
)
通过此代码段,您可以获得所需的信息:
using System;
using System.Collections.Generic;
using System.Xml;
namespace XPath
{
class MainClass
{
public static void Main()
{
XmlDocument doc = new XmlDocument();
doc.Load(".... your file");
// Xpaths
XmlNode root = doc.DocumentElement;
// Xpaths
XmlNode root = doc.DocumentElement;
XmlNodeList xmlFirstNameNodeList = root.SelectNodes("//html/head/meta[@name='firstname']");
XmlNodeList xmlLastNameNodeList = root.SelectNodes("//html/head/meta[@name='lastname']");
List<String> authors = new List<String>();
for(int i=0; i<xmlFirstNameNodeList.Count; i++) {
authors.Add(xmlFirstNameNodeList[i].Attributes["content"].Value + " " + xmlLastNameNodeList[i].Attributes["content"].Value);
}
Console.ReadKey();
}
}
}
列表作者的内容:
authors[0] = "Eddard Stark"
authors[1] = "Tywin Lannister"
authors[2] = "Jon Snow"