Question

我有几千个（ASP.net - 凌乱的HTML）html生成的发票，我正在尝试解析并保存到数据库中。

基本上像：

 foreach(var htmlDoc in HtmlFolder)
 {
   foreach(var inputBox in htmlDoc)
   { 
      //Make Collection of ID and Values Insert to DB
   }
 }

来自所有其他questions我已经阅读过这类问题的最佳工具是HtmlAgilityPack，但是对于我的生活，我无法获得文档.chm文件工作。关于如何使用或不使用Agility Pack实现此目的的任何想法？

提前致谢

Answer 1

HtmlAgilityPack的新替代方案是CsQuery。请参阅此later关于其相对性能优点的问题，但其对CSS选择器的使用不容错过：

var doc = CQ.CreateDocumentFromFile(htmldoc); //load, parse the file
var fields = doc["input"]; //get input fields with CSS
var pairs = fields.Select(node => new Tuple<string, string>(node.Id, node.Value()))
       //get values

Answer 2

要使CHM正常工作，您可能需要在Windows资源管理器和uncheck the "Unblock Content" checkbox中查看属性。

当您了解Linq-to-XML或XPath时，HTML Agility Pack非常简单。

您需要了解的基础知识：

//import the HtmlAgilityPack
using HtmlAgilityPack;

HtmlDocument doc = new HtmlDocument();

// Load your data
// -----------------------------
// Load doc from file:
doc.Load(pathToFile);

// OR

// Load doc from string:
doc.LoadHtml(contentsOfFile);
// -----------------------------

// Find what you're after
// -----------------------------
// Finding things using Linq
var nodes = doc.DocumentNode.DescendantsAndSelf("input")
    .Where(node => !string.IsNullOrWhitespace(node.Id)
        && node.Attributes["value"] != null
        && !string.IsNullOrWhitespace(node.Attributes["value"].Value));

// OR

// Finding things using XPath
var nodes = doc.DocumentNode
    .SelectNodes("//input[not(@id='') and not(@value='')]");
// -----------------------------


// looping through the nodes:
// the XPath interfaces can return null when no nodes are found
if (nodes != null) 
{ 
    foreach (var node in nodes)
    {
        var id = node.Id;
        var value = node.Attributes["value"].Value;
    }
}

add the HtmlAgility Pack is using NuGet的最简单方法：

PM＆GT;安装包HtmlAgilityPack

Answer 3

哈，看起来是我写的一个无耻插件库的理想时间！

这个应该很容易用这个库完成（顺便说一下，它是建立在HtmlAgility包之上的！）：https://github.com/amoerie/htmlbuilders （你可以在这里找到Nuget包：https://www.nuget.org/packages/HtmlBuilders/）

代码示例：

        const string html = "<div class='invoice'><input type='text' name='abc' value='123'/><input id='ohgood' type='text' name='def' value='456'/></div>";
        var htmlDocument = new HtmlDocument {OptionCheckSyntax = false}; // avoid exceptions when html is invalid
        htmlDocument.Load(new StringReader(html));
        var tag = HtmlTag.Parse(htmlDocument); // if there is a root tag
        var tags = HtmlTag.ParseAll(htmlDocument); // if there is no root tag

        // find looks recursively through the entire DOM tree
        var inputFields = tag.Find(t => string.Equals(t.TagName, "input"));

        foreach (var inputField in inputFields)
        {
            Console.WriteLine(inputField["type"]);
            Console.WriteLine(inputField["value"]);
            if(inputField.HasAttribute("id"))
                Console.WriteLine(inputField["id"]);
        }

请注意，如果该字段没有指定的属性名称，则inputField [attribute]将抛出'KeyNotFoundException'。那是因为HtmlTag为其属性实现并重用了IDictionary逻辑。

编辑：如果您没有在Web环境中运行此代码，则需要添加对System.Web的引用。那是因为这个库使用了可以在System.Web中找到的HtmlString类。只需选择“添加参考”，然后您就可以在“Assemblies”下找到它。框架“

Answer 4

您可以从HtmlAgilityPack下载here Documents CHM文件。

如果chm文件内容不可见，则取消选中Always ask before opening this file复选框，如屏幕截图所示

注意：上面的对话框显示未签名的文件

来源：HtmlAgilityPack Documentation

解析Html文档获取具有ID和值的所有输入字段

4 个答案:

PM＆GT;安装包HtmlAgilityPack