Java - 如何查找HTML字符串的标题

时间:2014-12-10 00:32:24

标签: java html html-parsing

给定一个HTML字符串(可能格式错误),如何找到title?这似乎很简单,但我很难这样做。


更新:根据要求,这里有一些网址,其HTML Jsoup似乎无法从中找到标题。我在一个月前收集了他们的HTML,所以有些可能已经改变了。

http://www.miamitodaynews.com/news/050113/crossword.shtml ()
http://www.miamitodaynews.com/news/081218/cal-highlights.shtml/feed/ ()
http://www.miashoes.com/mia-limited-edition/flats.html?refineclr=2125%2C2136 ()
http://www.mica.edu/News/Workshop_on_111809_Archive_and_Inventory_Your_Image_Collections.html ()
http://www.michaelgeist.ca/2011/10/daily-digital-lock-15/ ()
http://www.michaelkors.com/bags/_/N-283g?cmCat=cat000000cat144cat44301cat44302&index=9&isEditorial=false ()
http://www.michaelkors.com/watches/_/N-28c2?cmCat=cat000000cat145cat35701cat30001&index=39&isEditorial=false ()
http://www.michaelkors.com/watches/_/N-28c2?cmCat=cat000000cat145cat7502&index=92&isEditorial=false ()
http://www.michaelmillerfabrics.com/catalog/seo_sitemap/product/?p=2 ()
http://www.michaels.com/10104250.html ()


http://www.menseffects.com/PROMETHEUS-2-Switchblade-Automatic-Knife-p/att00176a.htm (http://www.menseffects.com/PROMETHEUS-2-Switchblade-Automatic-Knife-p/att00176a.htm)
http://www.menstennisforums.com/misc.php?do=whoposted&t=16764 (http://www.menstennisforums.com/misc.php?do=whoposted&t=16764)
http://www.menstennisforums.com/showpost.php?p=12242018&postcount=115 (http://www.menstennisforums.com/showpost.php?p=12242018&postcount=115)
http://www.menstennisforums.com/showpost.php?p=12623891&postcount=13 (http://www.menstennisforums.com/showpost.php?p=12623891&postcount=13)
http://www.menstennisforums.com/showpost.php?p=13010289&postcount=5476 (http://www.menstennisforums.com/showpost.php?p=13010289&postcount=5476)
http://www.menstylepower.com/category/blog/page/14/ ()
http://www.menstylepower.com/tag/mens-loafers/ ()
http://www.memorysuppliers.com/product-tag/usb-drive/?filter_color=46%2C45&filter_double-sided-imprint=295 ()
http://www.memorysuppliers.com/usb-flash-drives/?filter_imprint-area=306&filter_material=291&filter_price=305 ()
http://www.memorysuppliers.com/usb-flash-drives/best-sellers/?filter_color=51%2C27&filter_material=290&filter_price=302 ()
http://www.memorysuppliers.com/usb-flash-drives/best-sellers/?filter_color=51&filter_imprint-area=306&filter_speed=296 ()
http://www.memorysuppliers.com/usb-flash-drives/capless/?filter_color=51%2C47&filter_double-sided-imprint=294&filter_speed=296 ()
http://www.memphisdailynews.com/Search/Search.aspx?fn=Cathy&ln=Rogers&redir=1 ()
http://www.memphisdailynews.com/Search/Search.aspx?redir=1&sno=931%20Frayser%20Blvd ()
http://www.memphisdailynews.com/Search/Search.aspx?redir=1&sno=314%2BS.%2BMain%2BSt ()
http://www.memphisdailynews.com/news/2012/dec/27/starbucks-cups-to-come-with-a-political-message/ ()
http://www.memphisdailynews.com/news/2014/mar/24/tigers-season-ends-on-common-theme-underachieved/ ()
http://www.memphismagazine.com/December-2006/Blade-Runner/ ()

3 个答案:

答案 0 :(得分:1)

使用优秀的jsoup轻松轻松。看看here

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class SoGetTitleFromString {

    public static void main(String[] args) throws IOException {

        String html = "<html><head><title>First parse</title></head>"
                  + "<body><p>Parsed HTML into a doc.</p></body></html>";
        Document doc = Jsoup.parse(html);
        String title = doc.title();
        System.out.println("Title is: " + title);
    }
}

输出:

Title is: First parse

编辑:好的,你要做的是从一串网址中获取标题列表。您正在解析的字符串是网址列表,而不是HTML本身。试试这个:

import java.io.IOException;
import java.util.Scanner;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;


public class SoGetTitlesFromListOfUrls {

    public static void main(String[] args) throws IOException {

        String inUrls = "http://www.miamitodaynews.com/news/050113/crossword.shtml ()\n"
                + "http://www.miamitodaynews.com/news/081218/cal-highlights.shtml/feed/ ()\n"
                + "http://www.miashoes.com/mia-limited-edition/flats.html?refineclr=2125%2C2136 ()\n"
                + "http://www.mica.edu/News/Workshop_on_111809_Archive_and_Inventory_Your_Image_Collections.html ()\n"
                + "http://www.michaelgeist.ca/2011/10/daily-digital-lock-15/ ()\n"
                + "http://www.michaelkors.com/bags/_/N-283g?cmCat=cat000000cat144cat44301cat44302&index=9&isEditorial=false ()\n"
                + "http://www.michaelkors.com/watches/_/N-28c2?cmCat=cat000000cat145cat35701cat30001&index=39&isEditorial=false ()\n"
                + "http://www.michaelkors.com/watches/_/N-28c2?cmCat=cat000000cat145cat7502&index=92&isEditorial=false ()\n"
                + "http://www.michaelmillerfabrics.com/catalog/seo_sitemap/product/?p=2 ()\n"
                + "http://www.michaels.com/10104250.html ()\n";

        Scanner UrlScanner = new Scanner(inUrls);
        while (UrlScanner.hasNextLine()) {
            String url = UrlScanner.nextLine().split(" ")[0]; // Get the first token from the line, space delimited
            Document doc = Jsoup.connect(url).get();
            String title = doc.title();
            System.out.println("Title is: " + title);   
        }
    }
}

输出:

Title is: Miami Today Crossword Answers - Miami Today
Title is: Comments on: Calendar Of Events Highlights
Title is: MIA LIMITED EDITION FLATS - WOMEN FLATS
Title is: Workshop on 11.18.09: Archive & Inventory Your Image Collections | MICA
Title is: The Daily Digital Lock Dissenter, Day 15: Canadian Bookseller Association - Michael Geist
Title is: Handbags - Crossbody to Clutches to Totes & More | Michael Kors
Title is: Watches by Michael Kors - Womens & Mens Luxury, Chic & Timeless Styles
Title is: Watches by Michael Kors - Womens & Mens Luxury, Chic & Timeless Styles
Title is: Site Map
Title is: Creatology™ 3D Foam Kit, Pirate Ship

答案 1 :(得分:0)

使用适用于Java的HTML解析器(例如HTMLParser)或使用正则表达式从格式错误的HTML字符串中提取标题,可能是这样的(。*?)

答案 2 :(得分:0)

最简单的方法是使用正则表达式。从java2s.com获取此信息。

import java.io.DataInputStream;
import java.io.EOFException;
import java.net.URL;
import java.net.URLConnection;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Scratch {
  public static void main(String[] argv) throws Exception {

    URL url = new URL("http://www.java.com/");
    URLConnection urlConnection = url.openConnection();
    DataInputStream dis = new DataInputStream(urlConnection.getInputStream());
    String html = "", tmp = "";
    try {
        while ((tmp = dis.readUTF()) != null) {
          html += " " + tmp;
        }
    } catch (EOFException e) {
    // ignore 
    } finally {
     dis.close();
    }

    html = html.replaceAll("\\s+", " ");
    Pattern p = Pattern.compile("<title>(.*?)</title>");
    Matcher m = p.matcher(html);
    while (m.find() == true) {
      System.out.println(m.group(1));
    }
  }
}