Question

我正在使用java从网页上获取标题文本。

我使用标记名称从网页上获取图像，如下所示：

    int i=1; 
InputStream in=new URL("www.yahoo.com").openStream();
org.w3c.dom.Document doc= new Tidy().parseDOM(in, null);   
    NodeList img=doc.getElementsByTagName("img");
ArrayList<String> list=new ArrayList<String>();                   
    list.add(img.item(i).getAttributes().getNamedItem("src").getNodeValue());

它正在运行，但我想使用相同的代码从网页（www.yahoo.com）获取标题标签如上所述。我已经提到过getElementsByTagName（“title”）;但它不起作用。请帮助我，如何使用上面的jtidy解析器。

Answer 1

注意NodeList索引从0开始（我看到你的“int i = 1;”）http://download.oracle.com/javase/1.4.2/docs/api/org/w3c/dom/NodeList.html。

此外，您可以使用属性（即“src”）的“getNodeValue（）”，但不能使用元素http://download.oracle.com/javase/1.5.0/docs/api/org/w3c/dom/Node.html。在这种情况下，你可以使用“getTextContent（）”，因为我不相信“title”标签有子元素。所以：

String titleText = doc.getElementsByTagName("title").item(0).getTextContent();

或者：

String titleText = doc.getElementsByTagName("title").item(0).getFirstChild().getNodeValue();

Answer 2

您可以使用XPath轻松获取HTML页面的标题：

/html/head/title/text()

您可以使用Dom4J轻松实现此目标，我也认为在JTidy中也是如此。

Answer 3

除非你发布你实际上试图用来获得标题的代码，否则我们无法分辨，但这显然不起作用：

    list.add(img.item(i).getAttributes().getNamedItem("src").getNodeValue());

因为title元素没有src属性。

Answer 4

尝试一下，

InputStream response = null;
    try {
        String url = "http://example.com/"; // specify the URL
        response = new URL(url).openStream();


        Scanner scanner = new Scanner(response);
        String responseBody = scanner.useDelimiter("\\A").next();
        System.out.println(responseBody.substring(responseBody.indexOf("<title>") + 7, responseBody.indexOf("</title>"))); // it fetches the text inside the title tag

    } catch (IOException ex) {
        ex.printStackTrace();
    } finally {
        try {
            response.close();
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    }

如何从java中的任何网页获取标题文本

4 个答案: