我已经给出了包含任何有效网址的String。 我必须从给定的URL找到网站的名称。 我也忽略了子域名。
喜欢
http://www.yahoo.com => yahoo
www.google.co.in => google
http://in.com => in
http://india.gov.in/ => india
https://in.yahoo.com/ => yahoo
http://philotheoristic.tumblr.com/ =>tumblr
http://philotheoristic.tumblr.com/
https://in.movies.yahoo.com/ =>yahoo
如何做到这一点
答案 0 :(得分:2)
正则表达式可以帮助您:
String str = "www.google.co.in";
String [] res = str.split("(\\.|//)+(?=\\w)");
System.out.println(res[1]);
正则表达式是表示一组字符串的一种方式。该集由与表达式匹配的任何字符串组成。在上面的代码中,用作split
参数的字符串是匹配的正则表达式:Any“。”后跟一个字母数字文本或“//”后跟一个字母数字文本。
所以这些“。”和“//”子串是用于将字符串分割成部分的分隔符,是第一个用于站点名称的分隔符。
在“www.google.co.in”中,字符串会以这种方式分割:goole, co, in
。由于解决方案是使用spit数组的第一个元素,因此结果为:google
。
答案 1 :(得分:2)
哟可以使用URL
来自文档 - http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html
import java.net.*;
import java.io.*;
public class ParseURL {
public static void main(String[] args) throws MalformedURLException {
URL aURL = new URL("http://example.com:80/docs/books/tutorial"
+ "/index.html?name=networking#DOWNLOADING");
System.out.println("protocol = " + aURL.getProtocol());
System.out.println("authority = " + aURL.getAuthority());
System.out.println("host = " + aURL.getHost());
System.out.println("port = " + aURL.getPort());
System.out.println("path = " + aURL.getPath());
System.out.println("query = " + aURL.getQuery());
System.out.println("filename = " + aURL.getFile());
System.out.println("ref = " + aURL.getRef());
}
}
以下是程序显示的输出:
protocol = http
authority = example.com:80
host = example.com // name of website
port = 80
path = /docs/books/tutorial/index.html
query = name=networking
filename = /docs/books/tutorial/index.html?name=networking
ref = DOWNLOADING
因此,使用aURL.getHost()
即可获得网站名称。要忽略子域,您可以使用"."
将其拆分。因此,只有aURL.getHost().split(".")[0]
才能获得名称。
答案 2 :(得分:0)
没有任何可能的方法可以从网址中找到有效的网站名称。但是,如果您尝试剪切url字符串的特定部分,可以通过字符串操作执行此操作,如下所示
if(url.endsWith("co.in"){
website = url.substring(indexOfLostThirdDot, indexofco.in)
}
答案 3 :(得分:0)
我发现了类似的内容。虽然有些不同。
http://www.yahoo.com => Yahoo
http://www.google.co.in => Google
http://in.com => In.com Offers Videos, News, Photos, Celebs, Live TV Channels.....
http://india.gov.in/ => National Portal of India
https://in.yahoo.com/ => Yahoo India
http://philotheoristic.tumblr.com/ => Philotheoristic
https://in.movies.yahoo.com/ => Yahoo India Movies - Bollywood News, Movie Reviews & Hindi Movie Videos
这是代码
public class TitleExtractor {
/* the CASE_INSENSITIVE flag accounts for
* sites that use uppercase title tags.
* the DOTALL flag accounts for sites that have
* line feeds in the title text */
private static final Pattern TITLE_TAG =
Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
/**
* @param url the HTML page
* @return title text (null if document isn't HTML or lacks a title tag)
* @throws IOException
*/
public static String getPageTitle(String url) throws IOException {
URL u = new URL(url);
URLConnection conn = u.openConnection();
// ContentType is an inner class defined below
ContentType contentType = getContentTypeHeader(conn);
if (!contentType.contentType.equals("text/html"))
return null; // don't continue if not HTML
else {
// determine the charset, or use the default
Charset charset = getCharset(contentType);
if (charset == null)
charset = Charset.defaultCharset();
// read the response body, using BufferedReader for performance
InputStream in = conn.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(in, charset));
int n = 0, totalRead = 0;
char[] buf = new char[1024];
StringBuilder content = new StringBuilder();
// read until EOF or first 8192 characters
while (totalRead < 8192 && (n = reader.read(buf, 0, buf.length)) != -1) {
content.append(buf, 0, n);
totalRead += n;
}
reader.close();
// extract the title
Matcher matcher = TITLE_TAG.matcher(content);
if (matcher.find()) {
/* replace any occurrences of whitespace (which may
* include line feeds and other uglies) as well
* as HTML brackets with a space */
return matcher.group(1).replaceAll("[\\s\\<>]+", " ").trim();
}
else
return null;
}
}
/**
* Loops through response headers until Content-Type is found.
* @param conn
* @return ContentType object representing the value of
* the Content-Type header
*/
private static ContentType getContentTypeHeader(URLConnection conn) {
int i = 0;
boolean moreHeaders = true;
do {
String headerName = conn.getHeaderFieldKey(i);
String headerValue = conn.getHeaderField(i);
if (headerName != null && headerName.equals("Content-Type"))
return new ContentType(headerValue);
i++;
moreHeaders = headerName != null || headerValue != null;
}
while (moreHeaders);
return null;
}
private static Charset getCharset(ContentType contentType) {
if (contentType != null && contentType.charsetName != null && Charset.isSupported(contentType.charsetName))
return Charset.forName(contentType.charsetName);
else
return null;
}
/**
* Class holds the content type and charset (if present)
*/
private static final class ContentType {
private static final Pattern CHARSET_HEADER = Pattern.compile("charset=([-_a-zA-Z0-9]+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
private String contentType;
private String charsetName;
private ContentType(String headerValue) {
if (headerValue == null)
throw new IllegalArgumentException("ContentType must be constructed with a not-null headerValue");
int n = headerValue.indexOf(";");
if (n != -1) {
contentType = headerValue.substring(0, n);
Matcher matcher = CHARSET_HEADER.matcher(headerValue);
if (matcher.find())
charsetName = matcher.group(1);
}
else
contentType = headerValue;
}
}
}
使用这个类很简单:
String title = TitleExtractor.getPageTitle("http://en.wikipedia.org/");
System.out.println(title);
这是链接:
我希望它对你有所帮助。