Question

正如问题所述;有没有办法检测PHP页面中的所有URL，如果它们是相对的。当然，考虑到PHP页面中包含的URL可能出现在不同的行为中：

<link rel="stylesheet" href="/lib/css/hanv2/ie.css" />
<img src="/image.jpg">
<div style="background-image: url(/lib/data/emotion-header-v2/int-algemeen08.jpg)"></div>

所以我需要获取相对网址，无论其行为是什么css link，js link，image link，swf link

我正在使用AgilityPack，这里有一些C＃代码片段用于检测链接并检查它们是否是相对的：

      // to extract all a href tags
 private List<string> ExtractAllAHrefTags(HtmlAgilityPack.HtmlDocument htmlSnippet)
    {
        List<string> hrefTags = new List<string>();

        foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//link[@href]"))
        {
            HtmlAttribute att = link.Attributes["href"];
            hrefTags.Add(att.Value);
        }

        return hrefTags;
    }


    // to extract all img src tags
    private List<string> ExtractAllImgTags(HtmlAgilityPack.HtmlDocument htmlSnippet)
    {
        List<string> hrefTags = new List<string>();

        foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//img[@src]"))
        {
            HtmlAttribute att = link.Attributes["src"];
            hrefTags.Add(att.Value);
        }

        return hrefTags;
    }




       //to check whether path is relative       
            foreach (string s in AllHrefTags)
            {                  
                if (!s.StartsWith("http://") || !s.StartsWith("https://"))
                {
                    // path is not relative
                }
            }

我想知道是否有一种好的或更准确的方式来使用AgilityPack或其他方式从给定的HTML页面获取所有相对路径

Answer 1

您可以使用此xpath表达式从html页面中提取相关网址，这些网址是href或src值：

htmlSnippet.DocumentNode.SelectNodes("(//@src|//@href)[not(starts-with(.,'http://'))][not(starts-with(.,'https://'))]");

您可能希望过滤以#w开头的链接，用于跳转到当前页面上的特定位置，（例如：＆lt; a href =“＃tips”＆gt;）：

    htmlSnippet.DocumentNode.SelectNodes("(//@src|//@href)[not(starts-with(.,'http://'))][not(starts-with(.,'https://'))][not(starts-with(.,'#'))]");

如何检测HTML网页中的所有相对URL？

1 个答案: