这是我分割网址的代码,但该代码存在问题。所有链接都显示为双字,例如 www.utem.edu.my/portal/portal 。单词/ portal / portal在任何链接中总是加倍。有什么建议我在网页上提取链接吗?
public String crawlURL(String strUrl) {
String results = ""; // For return
String protocol = "http://";
// Assigns the input to the inURL variable and checks to add http
String inURL = strUrl;
if (!inURL.toLowerCase().contains("http://".toLowerCase()) &&
!inURL.toLowerCase().contains("https://".toLowerCase())) {
inURL = protocol + inURL;
}
// Pulls URL contents from the web
String contectURL = pullURL(inURL);
if (contectURL == "") { // If it fails, then try with https
protocol = "https://";
inURL = protocol + inURL.split("http://")[1];
contectURL = pullURL(inURL);
}
// Declares some variables to be used inside loop
String aTagAttr = "";
String href = "";
String msg = "";
// Finds A tag and stores its href value into output var
String bodyTag = contectURL.split("<body")[1]; // Find 1st <body>
String[] aTags = bodyTag.split(">"); // Splits on every tag
//To show link different from one another
int index = 0;
for (String s: aTags) {
// Process only if it is A tag and contains href
if (s.toLowerCase().contains("<a") && s.toLowerCase().contains("href")) {
aTagAttr = s.split("href")[1]; // Split on href
// Split on space if it contains it
if (aTagAttr.toLowerCase().contains("\\s"))
aTagAttr = aTagAttr.split("\\s")[2];
// Splits on the link and deals with " or ' quotes
href = aTagAttr.split(((aTagAttr.toLowerCase().contains("\""))? "\"" : "\'"))[1];
if (!results.toLowerCase().contains(href))
//results += "~~~ " + href + "\r\n";
/*
* Last touches to URl before display
* Adds http(s):// if not exist
* Adds base url if not exist
*/
if(results.toLowerCase().indexOf("about") != -1) {
//Contains 'about'
}
if (!href.toLowerCase().contains("http://") && !href.toLowerCase().contains("https://")) {
// http:// + baseURL + href
if (!href.toLowerCase().contains(inURL.split("://")[1]))
href = protocol + inURL.split("://")[1] + href;
else
href = protocol + href;
}
System.out.println(href);//debug
答案 0 :(得分:4)
考虑使用URL类...
按照文档的建议使用它: )
public static void main(String[] args) throws Exception {
URL aURL = new URL("http://example.com:80/docs/books/tutorial"
+ "/index.html?name=networking#DOWNLOADING");
System.out.println("protocol = " + aURL.getProtocol());
System.out.println("authority = " + aURL.getAuthority());
System.out.println("host = " + aURL.getHost());
System.out.println("port = " + aURL.getPort());
System.out.println("path = " + aURL.getPath());
System.out.println("query = " + aURL.getQuery());
System.out.println("filename = " + aURL.getFile());
System.out.println("ref = " + aURL.getRef());
}
}
输出:
protocol = http
authority = example.com:80
host = example.com
port = 80
等
在此之后,您可以获取所需的元素,创建一个新的字符串/ URL:)