选择脚本添加到DOM的元素

时间:2010-08-25 23:22:09

标签: c# asp.net html-agility-pack

我一直试图使用以下内容获取<object><embed>代码:

HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");

这似乎不起作用。

有谁能告诉我如何获取这些标签及其InnerHtml?

YouTube嵌入式视频如下所示:

    <embed height="385" width="640" type="application/x-shockwave-flash" 
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..." 
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">

我觉得JavaScript可能会阻止瑞士法郎玩家工作,希望不会......

干杯

1 个答案:

答案 0 :(得分:3)

更新2010-08-26(回应OP的评论)

我认为你正在考虑错误的方式,Alex。假设我写了一些看起来像这样的C#代码:

string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";

现在,如果我编写了一个C#解析器,它是否应该将上面的字符串文字的内容识别为C#代码并突出显示它(或其他)? ,因为在格式正确的C#文件的上下文中,该文本代表string变量被分配到的codeBlock

同样,在YouTube网页上的HTML中,<object><embed>元素在当前HTML文档的上下文中并非真正的元素。它们是驻留在JavaScript代码中的字符串值的内容。

事实上,如果HtmlAgilityPack 忽略了这一事实,并试图识别可能为HTML的文本的所有部分,那么它仍然无法成功这些元素因为在JavaScript内部,它们被\个字符严重转义(请注意我发布的代码中的不稳定Unescape方法以解决此问题。)

我不是说下面我的hacky解决方案是解决这个问题的正确方法;我只是在解释为什么获取这些元素并不像用HtmlAgilityPack抓取它们那么简单。


YouTubeScraper

好的,亚历克斯:你问过它,所以在这里。一些真正的hacky代码,可以从JavaScript的海洋中提取出宝贵的<object><embed>元素。

class YouTubeScraper
{
    public HtmlNode FindObjectElement(string url)
    {
        HtmlNodeCollection scriptNodes = FindScriptNodes(url);

        for (int i = 0; i < scriptNodes.Count; ++i)
        {
            HtmlNode scriptNode = scriptNodes[i];

            string javascript = scriptNode.InnerHtml;

            int objectNodeLocation = javascript.IndexOf("<object");

            if (objectNodeLocation != -1)
            {
                string htmlStart = javascript.Substring(objectNodeLocation);

                int objectNodeEndLocation = htmlStart.IndexOf(">\" :");

                if (objectNodeEndLocation != -1)
                {
                    string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);

                    string unescaped = Unescape(finalEscapedHtml);

                    var objectDoc = new HtmlDocument();

                    objectDoc.LoadHtml(unescaped);

                    HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");

                    return objectNode;
                }
            }
        }

        return null;
    }

    public HtmlNode FindEmbedElement(string url)
    {
        HtmlNodeCollection scriptNodes = FindScriptNodes(url);

        for (int i = 0; i < scriptNodes.Count; ++i)
        {
            HtmlNode scriptNode = scriptNodes[i];

            string javascript = scriptNode.InnerHtml;

            int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");

            if (approxEmbedNodeLocation != -1)
            {
                string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);

                int embedNodeEndLocation = htmlStart.IndexOf(">\";");

                if (embedNodeEndLocation != -1)
                {
                    string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);

                    string unescaped = Unescape(finalEscapedHtml);

                    var embedDoc = new HtmlDocument();

                    embedDoc.LoadHtml(unescaped);

                    HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");

                    return videoEmbedNode;
                }
            }
        }

        return null;
    }

    protected HtmlNodeCollection FindScriptNodes(string url)
    {
        var doc = new HtmlDocument();

        WebRequest request = WebRequest.Create(url);
        using (var response = request.GetResponse())
        using (var stream = response.GetResponseStream())
        {
            doc.Load(stream);
        }

        HtmlNode root = doc.DocumentNode;
        HtmlNodeCollection scriptNodes = root.SelectNodes("//script");

        return scriptNodes;
    }

    static string Unescape(string htmlFromJavascript)
    {
        // The JavaScript has escaped all of its HTML using backslashes. We need
        // to reverse this.

        // DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
        // of this code. If you could improve it, please, I beg of you to do so. Personally,
        // I tested it on a grand total of three inputs. It worked for those, at least.
        return Regex.Replace(htmlFromJavascript, @"\\(.)", UnescapeFromBeginning);
    }

    static string UnescapeFromBeginning(Match match)
    {
        string text = match.ToString();

        if (text.StartsWith("\\"))
        {
            return text.Substring(1);
        }

        return text;
    }
}

如果你感兴趣的话,这是一个我把它扔在一起的一个小演示(超级花哨,我知道):

class Program
{
    static void Main(string[] args)
    {
        var scraper = new YouTubeScraper();

        HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
        Console.WriteLine("David After Dentist:");
        Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
        Console.WriteLine();

        HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
        Console.WriteLine("Drunk History:");
        Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
        Console.WriteLine();

        HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
        Console.WriteLine("Jessica's Daily Affirmation:");
        Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
        Console.WriteLine();

        HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
        Console.WriteLine("Jazzercise - Move your Boogie Body:");
        Console.WriteLine(jazzerciseObjectNode.OuterHtml);
        Console.WriteLine();

        Console.Write("Finished! Hit Enter to quit.");
        Console.ReadLine();
    }
}

原始答案

为什么不尝试使用元素的Id呢?

HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");

更新:哦,伙计,您正在搜索 JavaScript中的HTML标记?这肯定是为什么这不起作用。 (它们不是真正从HtmlAgilityPack的角度解析的标记;所有JavaScript都是<script>标记内的一个大字符串。)也许有一些方法可以解析{{1}标签的内部文本本身 HTML并从那里开始。