Question

我正在尝试构建一个ASP.NET页面，该页面可以抓取网页并正确显示它们，并编辑所有相关的html元素以包含绝对URL。

此问题已部分回答https://stackoverflow.com/a/2719712/696638

结合上面的答案和本篇博文http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/，我构建了以下内容;

public partial class Crawler : System.Web.UI.Page {
    protected void Page_Load(object sender, EventArgs e) {
        Response.Clear();

        string url = Request.QueryString["path"];

        WebClient client = new WebClient();
        byte[] requestHTML = client.DownloadData(url);
        string sourceHTML = new UTF8Encoding().GetString(requestHTML);

        HtmlDocument htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(sourceHTML);

        foreach (HtmlNode link in htmlDoc.DocumentNode.SelectNodes("//a[@href]")) {
            if (!string.IsNullOrEmpty(link.Attributes["href"].Value)) {
                HtmlAttribute att = link.Attributes["href"];
                string href = att.Value;

                // ignore javascript on buttons using a tags
                if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue;

                Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute);
                if (!urlNext.IsAbsoluteUri) {
                    urlNext = new Uri(new Uri(url), urlNext);
                    att.Value = urlNext.ToString();
                }
            }
        }

        Response.Write(htmlDoc.DocumentNode.OuterHtml);

    }
}

这只替换了链接的href属性。通过扩展这一点，我想知道最有效的方法是什么;

href

<a>

href

<link>

src

<script>

src

<img>

action

<form>

还有其他人可以想到的吗？

使用怪物xpath对SelectNodes进行一次调用是否可以找到这些内容，或者多次调用SelectNodes并通过每个集合进行调整会更有效？

Answer 1

以下内容应该有效：

SelectNodes("//*[@href or @src or @action]")

然后你必须调整下面的if声明。

ASP.NET网页镜像，用绝对路径替换所有相对URL

1 个答案: