我正在镜像一些内部网站以进行备份。截至目前,我基本上使用这个c#代码:
System.Net.WebClient client = new System.Net.WebClient();
byte[] dl = client.DownloadData(url);
这基本上只是将html下载到一个字节数组中。这就是我要的。然而问题是html中的链接大多数时间是相对的,而不是绝对的。
我基本上想要在相对链接之前追加完整的http://domain.is,以便将其转换为将重定向到原始内容的绝对链接。我基本上只关心href =和src =。是否有正则表达式将涵盖一些基本情况?
编辑[我的尝试]:
public static string RelativeToAbsoluteURLS(string text, string absoluteUrl)
{
if (String.IsNullOrEmpty(text))
{
return text;
}
String value = Regex.Replace(
text,
"<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>",
"<$1$2=\"" + absoluteUrl + "$3\"$4>",
RegexOptions.IgnoreCase | RegexOptions.Multiline);
return value.Replace(absoluteUrl + "/", absoluteUrl);
}
答案 0 :(得分:9)
最强大的解决方案是使用其他人建议的HTMLAgilityPack。但是,使用带有Replace委托的MatchEvaluator重载可以使用正则表达式的合理解决方案,如下所示:
var baseUri = new Uri("http://test.com");
var pattern = @"(?<name>src|href)=""(?<value>/[^""]*)""";
var matchEvaluator = new MatchEvaluator(
match =>
{
var value = match.Groups["value"].Value;
Uri uri;
if (Uri.TryCreate(baseUri, value, out uri))
{
var name = match.Groups["name"].Value;
return string.Format("{0}=\"{1}\"", name, uri.AbsoluteUri);
}
return null;
});
var adjustedHtml = Regex.Replace(originalHtml, pattern, matchEvaluator);
上面的示例搜索名为src和href的属性,这些属性包含以正斜杠开头的双引号值。对于每个匹配,静态Uri.TryCreate方法用于确定该值是否为有效的相对uri。
请注意,此解决方案不处理单引号属性值,当然不适用于带有不带引号的值的格式错误的HTML。
答案 1 :(得分:5)
您应该使用HtmlAgility包加载HTML,使用它访问所有href,然后根据需要使用Uri类从相对转换为绝对。
参见例如http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/
答案 2 :(得分:5)
Uri WebsiteImAt = new Uri(
"http://www.w3schools.com/media/media_mimeref.asp?q=1&s=2,2#a");
string href = new Uri(WebsiteImAt, "/something/somethingelse/filename.asp")
.AbsoluteUri;
string href2 = new Uri(WebsiteImAt, "something.asp").AbsoluteUri;
string href3 = new Uri(WebsiteImAt, "something").AbsoluteUri;
哪个使用基于Regex
的方法 可能(未经测试)可映射到:
String value = Regex.Replace(text, "<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>", match =>
"<" + match.Groups[1].Value + match.Groups[2].Value + "=\""
+ new Uri(WebsiteImAt, match.Groups[3].Value).AbsoluteUri + "\""
+ match.Groups[4].Value + ">",RegexOptions.IgnoreCase | RegexOptions.Multiline);
我还应该建议不在这里使用Regex
,但要使用DOM将Uri技巧应用于某些代码,可能是XmlDocument
(如果是xhtml)或HTML敏捷包(否则),查看所有//@src
或//@href
属性。
答案 3 :(得分:1)
虽然这可能不是最强大的解决方案,但它应该完成工作。
var host = "http://domain.is";
var someHtml = @"
<a href=""/some/relative"">Relative</a>
<img src=""/some/relative"" />
<a href=""http://domain.is/some/absolute"">Absolute</a>
<img src=""http://domain.is/some/absolute"" />
";
someHtml = someHtml.Replace("src=\"" + host,"src=\"");
someHtml = someHtml.Replace("href=\"" + host,"src=\"");
someHtml = someHtml.Replace("src=\"","src=\"" + host);
someHtml = someHtml.Replace("href=\"","src=\"" + host);
答案 4 :(得分:1)
您可以使用HTMLAgilityPack来完成此操作。你会沿着这些(未经测试的)行做点什么:
以下是一些例子:
Relative to absolute paths in HTML (asp.net)
http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home
http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/
答案 5 :(得分:0)
我认为url是string类型。使用Uri而不是指向您的域的基础uri:
Uri baseUri = new Uri("http://domain.is");
Uri myUri = new Uri(baseUri, url);
System.Net.WebClient client = new System.Net.WebClient();
byte[] dl = client.DownloadData(myUri);
答案 6 :(得分:0)
只需使用此功能
'# converts relative URL ro Absolute URI
Function RelativeToAbsoluteUrl(ByVal baseURI As Uri, ByVal RelativeUrl As String) As Uri
' get action tags, relative or absolute
Dim uriReturn As Uri = New Uri(RelativeUrl, UriKind.RelativeOrAbsolute)
' Make it absolute if it's relative
If Not uriReturn.IsAbsoluteUri Then
Dim baseUrl As Uri = baseURI
uriReturn = New Uri(baseUrl, uriReturn)
End If
Return uriReturn
End Function
答案 7 :(得分:0)
简单功能
public string ConvertRelativeUrlToAbsoluteUrl(string relativeUrl)
{
if (Request.IsSecureConnection)
return string.Format("https://{0}{1}", Request.Url.Host, Page.ResolveUrl(relativeUrl));
else
return string.Format("http://{0}{1}", Request.Url.Host, Page.ResolveUrl(relativeUrl));
}
答案 8 :(得分:0)
我知道这是一个较老的问题,但我想通过一个相当简单的正则表达式来解决这个问题。这对我来说很有用。它处理http / https以及root-relative和current directory-relative。
var host = "http://www.google.com/";
var baseUrl = host + "images/";
var html = "<html><head></head><body><img src=\"/images/srpr/logo3w.png\" /><br /><img src=\"srpr/logo3w.png\" /></body></html>";
var regex = "(?<=(?:href|src)=\")(?!https?://)(?<url>[^\"]+)";
html = Regex.Replace(
html,
regex,
match => match.Groups["url"].Value.StartsWith("/")
? host + match.Groups["url"].Value.Substring(1)
: baseUrl + match.Groups["url"].Value);
答案 9 :(得分:0)
这就是您要寻找的内容,此代码段可以将所有相对URL转换为任何HTML代码中的绝对值:
Private Function ConvertALLrelativeLinksToAbsoluteUri(ByVal html As String, ByVal PageURL As String)
Dim result As String = Nothing
' Getting all Href
Dim opt As New RegexOptions
Dim XpHref As New Regex("(href="".*?"")", RegexOptions.IgnoreCase)
Dim i As Integer
Dim NewSTR As String = html
For i = 0 To XpHref.Matches(html).Count - 1
Application.DoEvents()
Dim Oldurl As String = Nothing
Dim OldHREF As String = Nothing
Dim MainURL As New Uri(PageURL)
OldHREF = XpHref.Matches(html).Item(i).Value
Oldurl = OldHREF.Replace("href=", "").Replace("HREF=", "").Replace("""", "")
Dim NEWURL As New Uri(MainURL, Oldurl)
Dim NewHREF As String = "href=""" & NEWURL.AbsoluteUri & """"
NewSTR = NewSTR.Replace(OldHREF, NewHREF)
Next
html = NewSTR
Dim XpSRC As New Regex("(src="".*?"")", RegexOptions.IgnoreCase)
For i = 0 To XpSRC.Matches(html).Count - 1
Application.DoEvents()
Dim Oldurl As String = Nothing
Dim OldHREF As String = Nothing
Dim MainURL As New Uri(PageURL)
OldHREF = XpSRC.Matches(html).Item(i).Value
Oldurl = OldHREF.Replace("src=", "").Replace("src=", "").Replace("""", "")
Dim NEWURL As New Uri(MainURL, Oldurl)
Dim NewHREF As String = "src=""" & NEWURL.AbsoluteUri & """"
NewSTR = NewSTR.Replace(OldHREF, NewHREF)
Next
Return NewSTR
End Function