Question

我已经给出了包含任何有效网址的String。我必须从给定的URL找到网站的名称。我也忽略了子域名。

喜欢

http://www.yahoo.com   =>    yahoo
www.google.co.in =>      google
http://in.com    =>      in
http://india.gov.in/ => india
https://in.yahoo.com/ => yahoo
http://philotheoristic.tumblr.com/  =>tumblr
http://philotheoristic.tumblr.com/
https://in.movies.yahoo.com/        =>yahoo

如何做到这一点

Answer 1

正则表达式可以帮助您：

 String str = "www.google.co.in";
 String [] res = str.split("(\\.|//)+(?=\\w)");
 System.out.println(res[1]);

正则表达式是表示一组字符串的一种方式。该集由与表达式匹配的任何字符串组成。在上面的代码中，用作split参数的字符串是匹配的正则表达式：Any“。”后跟一个字母数字文本或“//”后跟一个字母数字文本。所以这些“。”和“//”子串是用于将字符串分割成部分的分隔符，是第一个用于站点名称的分隔符。

在“www.google.co.in”中，字符串会以这种方式分割：goole, co, in。由于解决方案是使用spit数组的第一个元素，因此结果为：google。

Answer 2

哟可以使用URL

来自文档 - http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html

import java.net.*;
import java.io.*;

public class ParseURL {
    public static void main(String[] args) throws MalformedURLException {

        URL aURL = new URL("http://example.com:80/docs/books/tutorial"
                           + "/index.html?name=networking#DOWNLOADING");

        System.out.println("protocol = " + aURL.getProtocol());
        System.out.println("authority = " + aURL.getAuthority());
        System.out.println("host = " + aURL.getHost());
        System.out.println("port = " + aURL.getPort());
        System.out.println("path = " + aURL.getPath());
        System.out.println("query = " + aURL.getQuery());
        System.out.println("filename = " + aURL.getFile());
        System.out.println("ref = " + aURL.getRef());
    }
}

以下是程序显示的输出：

protocol = http
authority = example.com:80
host = example.com                     // name of website
port = 80
path = /docs/books/tutorial/index.html
query = name=networking
filename = /docs/books/tutorial/index.html?name=networking
ref = DOWNLOADING

因此，使用aURL.getHost()即可获得网站名称。要忽略子域，您可以使用"."将其拆分。因此，只有aURL.getHost().split(".")[0]才能获得名称。

Answer 3

没有任何可能的方法可以从网址中找到有效的网站名称。但是，如果您尝试剪切url字符串的特定部分，可以通过字符串操作执行此操作，如下所示

if(url.endsWith("co.in"){

  website = url.substring(indexOfLostThirdDot, indexofco.in)
}

Answer 4

我发现了类似的内容。虽然有些不同。

http://www.yahoo.com   =>    Yahoo
http://www.google.co.in =>      Google
http://in.com    => In.com Offers Videos, News, Photos, Celebs, Live TV Channels.....
http://india.gov.in/ => National Portal of India
https://in.yahoo.com/ => Yahoo India
http://philotheoristic.tumblr.com/  => Philotheoristic
https://in.movies.yahoo.com/ => Yahoo India Movies - Bollywood News, Movie Reviews &amp;    Hindi Movie Videos

这是代码

public class TitleExtractor {
/* the CASE_INSENSITIVE flag accounts for
 * sites that use uppercase title tags.
 * the DOTALL flag accounts for sites that have
 * line feeds in the title text */
private static final Pattern TITLE_TAG =
    Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

/**
 * @param url the HTML page
 * @return title text (null if document isn't HTML or lacks a title tag)
 * @throws IOException
 */
public static String getPageTitle(String url) throws IOException {
    URL u = new URL(url);
    URLConnection conn = u.openConnection();

    // ContentType is an inner class defined below
    ContentType contentType = getContentTypeHeader(conn);
    if (!contentType.contentType.equals("text/html"))
        return null; // don't continue if not HTML
    else {
        // determine the charset, or use the default
        Charset charset = getCharset(contentType);
        if (charset == null)
            charset = Charset.defaultCharset();

        // read the response body, using BufferedReader for performance
        InputStream in = conn.getInputStream();
        BufferedReader reader = new BufferedReader(new InputStreamReader(in, charset));
        int n = 0, totalRead = 0;
        char[] buf = new char[1024];
        StringBuilder content = new StringBuilder();

        // read until EOF or first 8192 characters
        while (totalRead < 8192 && (n = reader.read(buf, 0, buf.length)) != -1) {
            content.append(buf, 0, n);
            totalRead += n;
        }
        reader.close();

        // extract the title
        Matcher matcher = TITLE_TAG.matcher(content);
        if (matcher.find()) {
            /* replace any occurrences of whitespace (which may
             * include line feeds and other uglies) as well
             * as HTML brackets with a space */
            return matcher.group(1).replaceAll("[\\s\\<>]+", " ").trim();
        }
        else
            return null;
    }
}

/**
 * Loops through response headers until Content-Type is found.
 * @param conn
 * @return ContentType object representing the value of
 * the Content-Type header
 */
private static ContentType getContentTypeHeader(URLConnection conn) {
    int i = 0;
    boolean moreHeaders = true;
    do {
        String headerName = conn.getHeaderFieldKey(i);
        String headerValue = conn.getHeaderField(i);
        if (headerName != null && headerName.equals("Content-Type"))
            return new ContentType(headerValue);

        i++;
        moreHeaders = headerName != null || headerValue != null;
    }
    while (moreHeaders);

    return null;
}

private static Charset getCharset(ContentType contentType) {
    if (contentType != null && contentType.charsetName != null && Charset.isSupported(contentType.charsetName))
        return Charset.forName(contentType.charsetName);
    else
        return null;
}

/**
 * Class holds the content type and charset (if present)
 */
private static final class ContentType {
    private static final Pattern CHARSET_HEADER = Pattern.compile("charset=([-_a-zA-Z0-9]+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

    private String contentType;
    private String charsetName;
    private ContentType(String headerValue) {
        if (headerValue == null)
            throw new IllegalArgumentException("ContentType must be constructed with a not-null headerValue");
        int n = headerValue.indexOf(";");
        if (n != -1) {
            contentType = headerValue.substring(0, n);
            Matcher matcher = CHARSET_HEADER.matcher(headerValue);
            if (matcher.find())
                charsetName = matcher.group(1);
        }
        else
            contentType = headerValue;
    }
}
}

使用这个类很简单：

 String title = TitleExtractor.getPageTitle("http://en.wikipedia.org/");
 System.out.println(title);

这是链接：

http://www.gotoquiz.com/web-coding/programming/java-programming/how-to-extract-titles-from-web-pages-in-java/

我希望它对你有所帮助。

如何从任何字符串URL获取网站名称

4 个答案: