我目前正在使用具有指定网址列的CSV。
我正在使用URL从特定端点提取数据。
我目前的方法是解析URI的主机,发送到我的端点的是以下内容:
public static Uri GetURI(string s)
{
return new UriBuilder(s).Uri;
}
if ( websiteLoc > -1 ) {
if(!String.IsNullOrEmpty(row[websiteLoc]))
{
Uri uri = GetURI(row[websiteLoc]);
record.Add("website", uri.Host);
} else {
record.Add("website", "");
}
}
但是,由于输入网址的人员出现人为错误,网址可能会格式错误/结构化/打字。
我在列中遇到格式极差的网址,例如:
htpp://www.sessler.cm.utexas.edu/
http //www.fedex.com
http.redhead-int.com
http:://www.741limo.com/
在这些情况下,我的代码错误地解析了正确的主机或抛出了错误。有没有更好的方法来尝试正确解析这些可怕的URL?
答案 0 :(得分:0)
如果你依赖于Uri.IsWellFormedUriString(),你会得到大量的假阴性/阳性,但......
您还可以使用REGEX和/或字符串比较的组合来尝试增加有效的URL命中率。请注意,我在一些Regex更好的地方使用字符比较。我试图表明你可以做REGEX或直接比较。
更严格的REGEX和/或其他逻辑可以产生更好的结果。这不会产生100%的命中率,但它会清理许多类型不佳的URL。请注意,REGEX需要引用System.Text.RegularExpressions:
private List<int> acceptableChars = new List<int>();
private void createListOfAcceptableCharacters()
{
List<int> acceptableChars = new List<int>();
acceptableChars.Add(45);// -
acceptableChars.Add(46);// .
acceptableChars.Add(47);// /
for (int a = 97; a < 123; a++)
{// a through z
acceptableChars.Add(a);
}
for (int a = 48; a < 58; a++)
{//0 through 9
acceptableChars.Add(a);
}
}
public string parseURL(string input)
{//you would only do this once in reality
createListOfAcceptableCharacters();
//basic cleanup
input = input.ToLower();
//Regex.Replace would be more elegant here but string.replace works, too
input = input.Replace(".cm", ".com").Replace("htpp","http").Replace("htp", "http").Replace("http.", "http:").Replace("//ww.", "//www.").Replace(":/", "://").Replace(":////", "://").Replace(":///", "://").Replace(" ","");
//check to see if URL is generally valid as-is
bool isValid = isValidURL(input);
if (isValid)
{
return input;
}
//try to salvage a poorly formed URL
bool isSecure = input.Substring(0, 5).IndexOf("https") > -1 ? true : false;
input = input.Replace(" ","").Replace(":","").Replace("https","").Replace("http//", "").Replace("http/", "").Replace("http", "").Replace("http", "").Replace("//","").Replace("www","").Replace("ww","").Replace("cm","com");//again, regex.replace would be more elegant
//clear front end to first period if it exists before space 6
if (input.IndexOf(".") < 7)
{
int period = input.IndexOf(".");
input = input.Substring(period+1);
}
//get the extension
string extension = "";
if (input.Substring(input.Length - 1) == "/")
{
input = input.Substring(0, input.Length - 1);
}
if (input.Substring(input.Length - 4, 1) == ".")
{
//extension is where we expect
extension = input.Substring(input.Length - 3, 3);
input = input.Replace("." + extension, "");
}else
{
//cannot find extension - can't process
return "badURL";
}
string url = "";
//move backwars through path, collecting only acceptable characters (note, this can be done with REGEX as well)
for (int i = input.Length-1; i > -1; i--)
{
string thisChar = input.Substring(i, 1);
if (thisChar == ":")
{
return "http://" + url + "." + extension;
}
int utf = (int)Convert.ToChar(thisChar);
//compare current char to list of acceptable chars
if (acceptableChars.Contains(utf))
{
url = thisChar+url;
}
}
//final cleanup and full path formation
if (url.Substring(0, 1) == ".")
{
url = url.Substring(1);
}
url = isSecure ? "https://" : "http://" + url + "." + extension;
//optional
url = isSecure ? "https://www." : "http://www." + url + "." + extension;
url = url.Replace("::", ":");
//test salvaged url. If reasonable, return else return badURL
if (isValidURL(url))
{
return url;
}else
{
return "badURL";
}
}
private bool isValidURL(string url)
{
bool isValid = Regex.IsMatch(url, @"^((http[s]?:[/][/])?(\w+[\-.])+com|((http[s]?:[/][/])?(\w+[\-.])+com[/]|[.][/])?\w+([/]\w+)*([/]|[.]html|[.]php|[.]gif|[.]jpg|[.]png)?)$");
return isValid;
}
我可能会建议修改parse方法以返回“fixed”url而不是“badURL” - 有时它会修复一个实际有效的URL,但会被RegEx错误拒绝。而不是处理“badURL”的返回,你可以尝试访问URL并处理那里的任何错误。
请注意 - 这绝不是为了代表最终的,有效的解决方案。您必须使用多个输入进行测试和修改,并相应地清理/优化代码。它仅用于演示如何使用字符串比较和Regex来尝试清理URL。