给定一个HTML字符串(可能格式错误),如何找到title
?这似乎很简单,但我很难这样做。
更新:根据要求,这里有一些网址,其HTML Jsoup似乎无法从中找到标题。我在一个月前收集了他们的HTML,所以有些可能已经改变了。
http://www.miamitodaynews.com/news/050113/crossword.shtml ()
http://www.miamitodaynews.com/news/081218/cal-highlights.shtml/feed/ ()
http://www.miashoes.com/mia-limited-edition/flats.html?refineclr=2125%2C2136 ()
http://www.mica.edu/News/Workshop_on_111809_Archive_and_Inventory_Your_Image_Collections.html ()
http://www.michaelgeist.ca/2011/10/daily-digital-lock-15/ ()
http://www.michaelkors.com/bags/_/N-283g?cmCat=cat000000cat144cat44301cat44302&index=9&isEditorial=false ()
http://www.michaelkors.com/watches/_/N-28c2?cmCat=cat000000cat145cat35701cat30001&index=39&isEditorial=false ()
http://www.michaelkors.com/watches/_/N-28c2?cmCat=cat000000cat145cat7502&index=92&isEditorial=false ()
http://www.michaelmillerfabrics.com/catalog/seo_sitemap/product/?p=2 ()
http://www.michaels.com/10104250.html ()
http://www.menseffects.com/PROMETHEUS-2-Switchblade-Automatic-Knife-p/att00176a.htm (http://www.menseffects.com/PROMETHEUS-2-Switchblade-Automatic-Knife-p/att00176a.htm)
http://www.menstennisforums.com/misc.php?do=whoposted&t=16764 (http://www.menstennisforums.com/misc.php?do=whoposted&t=16764)
http://www.menstennisforums.com/showpost.php?p=12242018&postcount=115 (http://www.menstennisforums.com/showpost.php?p=12242018&postcount=115)
http://www.menstennisforums.com/showpost.php?p=12623891&postcount=13 (http://www.menstennisforums.com/showpost.php?p=12623891&postcount=13)
http://www.menstennisforums.com/showpost.php?p=13010289&postcount=5476 (http://www.menstennisforums.com/showpost.php?p=13010289&postcount=5476)
http://www.menstylepower.com/category/blog/page/14/ ()
http://www.menstylepower.com/tag/mens-loafers/ ()
http://www.memorysuppliers.com/product-tag/usb-drive/?filter_color=46%2C45&filter_double-sided-imprint=295 ()
http://www.memorysuppliers.com/usb-flash-drives/?filter_imprint-area=306&filter_material=291&filter_price=305 ()
http://www.memorysuppliers.com/usb-flash-drives/best-sellers/?filter_color=51%2C27&filter_material=290&filter_price=302 ()
http://www.memorysuppliers.com/usb-flash-drives/best-sellers/?filter_color=51&filter_imprint-area=306&filter_speed=296 ()
http://www.memorysuppliers.com/usb-flash-drives/capless/?filter_color=51%2C47&filter_double-sided-imprint=294&filter_speed=296 ()
http://www.memphisdailynews.com/Search/Search.aspx?fn=Cathy&ln=Rogers&redir=1 ()
http://www.memphisdailynews.com/Search/Search.aspx?redir=1&sno=931%20Frayser%20Blvd ()
http://www.memphisdailynews.com/Search/Search.aspx?redir=1&sno=314%2BS.%2BMain%2BSt ()
http://www.memphisdailynews.com/news/2012/dec/27/starbucks-cups-to-come-with-a-political-message/ ()
http://www.memphisdailynews.com/news/2014/mar/24/tigers-season-ends-on-common-theme-underachieved/ ()
http://www.memphismagazine.com/December-2006/Blade-Runner/ ()
答案 0 :(得分:1)
使用优秀的jsoup轻松轻松。看看here。
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class SoGetTitleFromString {
public static void main(String[] args) throws IOException {
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
String title = doc.title();
System.out.println("Title is: " + title);
}
}
输出:
Title is: First parse
编辑:好的,你要做的是从一串网址中获取标题列表。您正在解析的字符串是网址列表,而不是HTML本身。试试这个:
import java.io.IOException;
import java.util.Scanner;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class SoGetTitlesFromListOfUrls {
public static void main(String[] args) throws IOException {
String inUrls = "http://www.miamitodaynews.com/news/050113/crossword.shtml ()\n"
+ "http://www.miamitodaynews.com/news/081218/cal-highlights.shtml/feed/ ()\n"
+ "http://www.miashoes.com/mia-limited-edition/flats.html?refineclr=2125%2C2136 ()\n"
+ "http://www.mica.edu/News/Workshop_on_111809_Archive_and_Inventory_Your_Image_Collections.html ()\n"
+ "http://www.michaelgeist.ca/2011/10/daily-digital-lock-15/ ()\n"
+ "http://www.michaelkors.com/bags/_/N-283g?cmCat=cat000000cat144cat44301cat44302&index=9&isEditorial=false ()\n"
+ "http://www.michaelkors.com/watches/_/N-28c2?cmCat=cat000000cat145cat35701cat30001&index=39&isEditorial=false ()\n"
+ "http://www.michaelkors.com/watches/_/N-28c2?cmCat=cat000000cat145cat7502&index=92&isEditorial=false ()\n"
+ "http://www.michaelmillerfabrics.com/catalog/seo_sitemap/product/?p=2 ()\n"
+ "http://www.michaels.com/10104250.html ()\n";
Scanner UrlScanner = new Scanner(inUrls);
while (UrlScanner.hasNextLine()) {
String url = UrlScanner.nextLine().split(" ")[0]; // Get the first token from the line, space delimited
Document doc = Jsoup.connect(url).get();
String title = doc.title();
System.out.println("Title is: " + title);
}
}
}
输出:
Title is: Miami Today Crossword Answers - Miami Today
Title is: Comments on: Calendar Of Events Highlights
Title is: MIA LIMITED EDITION FLATS - WOMEN FLATS
Title is: Workshop on 11.18.09: Archive & Inventory Your Image Collections | MICA
Title is: The Daily Digital Lock Dissenter, Day 15: Canadian Bookseller Association - Michael Geist
Title is: Handbags - Crossbody to Clutches to Totes & More | Michael Kors
Title is: Watches by Michael Kors - Womens & Mens Luxury, Chic & Timeless Styles
Title is: Watches by Michael Kors - Womens & Mens Luxury, Chic & Timeless Styles
Title is: Site Map
Title is: Creatology™ 3D Foam Kit, Pirate Ship
答案 1 :(得分:0)
使用适用于Java的HTML解析器(例如HTMLParser)或使用正则表达式从格式错误的HTML字符串中提取标题,可能是这样的(。*?)
答案 2 :(得分:0)
最简单的方法是使用正则表达式。从java2s.com获取此信息。
import java.io.DataInputStream;
import java.io.EOFException;
import java.net.URL;
import java.net.URLConnection;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Scratch {
public static void main(String[] argv) throws Exception {
URL url = new URL("http://www.java.com/");
URLConnection urlConnection = url.openConnection();
DataInputStream dis = new DataInputStream(urlConnection.getInputStream());
String html = "", tmp = "";
try {
while ((tmp = dis.readUTF()) != null) {
html += " " + tmp;
}
} catch (EOFException e) {
// ignore
} finally {
dis.close();
}
html = html.replaceAll("\\s+", " ");
Pattern p = Pattern.compile("<title>(.*?)</title>");
Matcher m = p.matcher(html);
while (m.find() == true) {
System.out.println(m.group(1));
}
}
}