Question

我使用HtmlAgilityPack来解析webbrowser控件的html文档。我能够找到我想要的HtmlNode，但在获得HtmlNode之后，我想在WebbrowserControl.Document中重新调用相应的HtmlElement。

实际上HtmlAgilityPack解析了实时文档的离线副本，而我想访问webbrowser控件的实时元素以访问某些呈现的属性，如currentStyle或runtimeStyle

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.Document.Body.InnerHtml);
var some_nodes = doc.DocumentNode.SelectNodes("//p"); 
// this selection could be more sophisticated 
// and the answer shouldn't relay on it.
foreach (HtmlNode node in some_nodes)
{
   HtmlElement live_element = CorrespondingElementFromWebBrowserControl(node);
   // CorrespondingElementFromWebBrowserControl is what I am searching for
}

如果元素具有特定属性，那么它可能很容易，但我想要一个适用于任何元素的解决方案。

请帮我解决一下。

Answer 1

实际上似乎没有直接在webbroser控件中更改文档的可能性。但是你可以从中提取html，将其复制并再次将其写回：

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.DocumentText);

foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.ChildNodes) {
    node.Attributes.Add("TEST", "TEST");
}

StringBuilder sb = new StringBuilder();
using (StringWriter sw = new StringWriter(sb)) {
    doc.Save(sw);
    webBrowser1.DocumentText = sb.ToString();
}

对于直接操作，您可以使用非托管指针webBrowser1.Document.DomDocument来处理文档，但这不在我的掌握之中。

Answer 2

HtmlAgilityPack绝对不能直接提供对实时HTML中节点的访问。既然你说元素没有明确的样式/类/ id，你必须手动遍历节点并找到匹配。

假设HTML合理有效（因此浏览器和HtmlAgilityPack同样执行规范化），您可以从两个树的根开始并选择相同的子节点来处理元素对。

基本上你可以建立＆＃34;基于位置的＆＃34; XPath到一棵树中的节点并在另一棵树中选择它。 Xpath看起来像（取决于你只想关注位置或位置和节点名称）：

 "/*[1]/*[4]/*[2]/*[7]"
 "/body/div[2]/span[1]/p[3]"

步骤：

在使用HtmlNode时，您发现收集了根目录下的所有父节点。
在浏览器中获取HTML元素的根
对于每个级别的孩子，在其父级的步骤1中找到HtmlNodes集合上相应子级的位置，然后在当前活动节点的子级中查找实时HtmlElement。
转移到新找到的孩子，然后回到3，直到找到您正在寻找的节点。

Answer 3

XPath的{{1}}属性显示从根到节点的路径上的节点。例如HtmlAgilityPack.HtmlNode。您可以在实时文档中遍历此路径以查找相应的实时元素。但是这条路径可能不准确，因为HtmlAgilityPack删除了一些标签，如\div[1]\div[2]\table[0]然后在使用此解决方案之前使用

添加省略的标签

<form>

以下方法根据XPath

查找live元素

HtmlNode.ElementsFlags.Remove("form");

struct DocNode  
{
    public string Name;
    public int Pos;
}
///// structure to hold the name and position of each node in the path

上面使用的GetChild方法的代码

    static public HtmlElement GetLiveElement(HtmlNode node, HtmlDocument doc)
    {
        var pattern = @"/(.*?)\[(.*?)\]"; // like div[1]
        // Parse the XPath to extract the nodes on the path
        var matches = Regex.Matches(node.XPath, pattern); 
        List<DocNode> PathToNode = new List<DocNode>();
        foreach (Match m in matches) // Make a path of nodes
        {
            DocNode n = new DocNode();
            n.Name = n.Name = m.Groups[1].Value;
            n.Pos = Convert.ToInt32(m.Groups[2].Value)-1;
            PathToNode.Add(n); // add the node to path 
        }

        HtmlElement elem = null; //Traverse to the element using the path
        if (PathToNode.Count > 0)
        {
            elem = doc.Body; //begin from the body
            foreach (DocNode n in PathToNode)
            {
                //Find the corresponding child by its name and position
                elem = GetChild(elem, n);                    
            }
        }
        return elem;
    }

基于HtmlAgilityPack.HtmlNode的Gettig Htmlelement

3 个答案: