从抓取的网址获取绝对网址的最合适方式

时间:2014-08-12 19:23:39

标签: c# wpf uri web-crawler

假设我有根网址

http://www.monstermmorpg.com

现在我将展示几个url示例以及如何获取目标

url1: http://www.monstermmorpg.com/
url2: http://www.monstermmorpg.com/Register#21312
url3: Register#21312
url4: /Register
url5: Register
url6: /Register?news=true&news2=true
// there may be more that goes to same url but i don't have full list atm

我需要一个在根网址

的帮助下跟随网址的功能
url1: http://www.monstermmorpg.com
url2: http://www.monstermmorpg.com/Register
url3: http://www.monstermmorpg.com/Register
url4: http://www.monstermmorpg.com/Register
url5: http://www.monstermmorpg.com/Register
url6: http://www.monstermmorpg.com/Register?news=true&news2=true

有这种方法,但我相信这不是更好的方法吗?

C#.net 4.5 WPF应用程序

Uri baseUri= new Uri("http://www.contoso.com");
 Uri myUri = new Uri(baseUri,"catalog/shownew.htm?date=today");
 Console.WriteLine(myUri.AbsoluteUri);

1 个答案:

答案 0 :(得分:1)

static void Main(string[] args)
{
    var baseUrl = "http://www.monstermmorpg.com";

    var urls = new string[] {
        "http://www.monstermmorpg.com/",
        "http://www.monstermmorpg.com/Register#21312",
        "Register#21312",
        "/Register",
        "Register",
        "/Register?news=true&news2=true" };

    var absoluteUrls = new List<string>();

    foreach (var url in urls)
    {
        if (url.StartsWith("http"))
        {
            var uri = new Uri(url);
            absoluteUrls.Add(uri.Host + uri.PathAndQuery);
        }
        else
        {
            var urlWithSlash = url;
            if (!urlWithSlash.StartsWith("/"))
                urlWithSlash = "/" + url;

            var uri = new Uri(baseUrl + urlWithSlash);
            absoluteUrls.Add(uri.Host + uri.PathAndQuery);
        }
    }

    // Now absoluteUrls contains 
    //url1: http://www.monstermmorpg.com
    //url2: http://www.monstermmorpg.com/Register
    //url3: http://www.monstermmorpg.com/Register
    //url4: http://www.monstermmorpg.com/Register
    //url5: http://www.monstermmorpg.com/Register
    //url6: http://www.monstermmorpg.com/Register?news=true&news2=true
}