Question

如何使用HTML解析器获取给定网址的网页标题？是否可以使用正则表达式获得标题？我更喜欢使用HTML解析器。

我在Java Eclipse IDE中工作。

我尝试使用以下代码，但未成功。

有什么想法吗？

提前感谢！

import org.htmlparser.Node;

import org.htmlparser.Parser;

import org.htmlparser.util.NodeList;

import org.htmlparser.util.ParserException;

import org.htmlparser.tags.TitleTag;    

public class TestHtml {

public static void main(String... args) {
    Parser parser = new Parser();     
    try {
        parser.setResource("http://www.yahoo.com/");
        NodeList list = parser.parse(null);
        Node node = list.elementAt(0);

        if (node instanceof TitleTag) {
           TitleTag title = (TitleTag) node;


            System.out.println(title.getText());

        }

    } catch (ParserException e) {
        e.printStackTrace();
    }
}

}

Answer 1

根据您的（重新定义的）问题，问题是您只检查第一个节点Node node = list.elementAt(0);，而您应该遍历列表以找到标题（这不是第一个）。您还可以使用NodeFilter parse()仅返回TitleTag，然后标题将在第一个中，您不必进行迭代。

Answer 2

好吧 - 假设您正在使用java，但大多数语言中都有相同的东西 - 您可以使用SAX解析器（例如将任何html转换为xhtml的TagSoup），并且可以在处理程序中执行：

public class MyHandler extends org.xml.sax.helpers.DefaultHandler {
    boolean readTitle = false;
    StringBuilder title = new StringBuilder();

    public void startElement(String uri, String localName, String name,
                Attributes attributes) throws SAXException {
        if(localName.equals("title") {
            readTitle = true;
        }
    }

    public void endElement(String uri, String localName, String name)
            throws SAXException {
        if(localName.equals("title") {
            readTitle = false;
        }
    }

    public void characters(char[] ch, int start, int length)
            throws SAXException {
        if(readTitle) title.append(new String(ch, start, length));
    }
}

并在解析器中使用它（带有tagsoup的示例）：

org.ccil.cowan.tagsoup.Parser parser = new Parser();
MyHandler handler = new MyHander();
parser.setContentHandler(handler);
parser.parse(an input stream to your html file);
return handler.title.toString();

Answer 3

顺便说一下，HTMLParser附带了一个非常简单的标题摘录。您可以使用：http://htmlparser.sourceforge.net/samples.html

运行它的方法是（来自HtmlParser代码库）：运行：

bin/parser http://website_url TITLE

或运行

java -jar <path to htmlparser.jar> http://website_url TITLE

或从您的代码调用方法

org.htmlparser.Parser.main(String[] args)

参数new String[] {"<website url>", "TITLE"}

Answer 4

RegEx match open tags except XHTML self-contained tags

聪明，你不想使用正则表达式。

要使用HTML解析器，我们需要知道您正在使用哪种语言。既然你说你在“日食”，我会假设Java。

请查看http://www.ibm.com/developerworks/xml/library/x-domjava/以获取说明，概述和各种观点。

Answer 5

使用HTMLAgilityPack非常容易，您只需要以字符串的形式响应httpRequest。

    String response=httpRequest.getResponseString(); // this may have a few changes or no 
HtmlDocument doc= new HtmlDocument();
doc.loadHtml(response);
HtmlNode node =doc.DocumentNode.selectSingleNode("//title"); // this line will fetch title tage from whole html document and return collection could iterate
node.innerText; //gives you the title of the page

helloWorld node.innerText包含helloWorld

OR

String response=httpRequest.getResponseString(); // this may have a few changes or no 
HtmlDocument doc= new HtmlDocument();
doc.loadHtml(response);

HtmlNode node =doc.DocumentNode.selectSingleNode("//head");// this additional will get head which is a single node in html than get title from head's childrens
HtmlNode node =node.selectSingleNode("//title"); // this line will fetch title tage from whole html document and return collection could iterate


node.innerText; //gives you the title of the page

如何使用html解析器获取网页标题

5 个答案: