阅读HTML,如何使用BufferedReader跳过网页中的HEAD标签信息,逐行读取HTML?

时间:2013-12-12 04:50:26

标签: java html bufferedreader

我有一个简单的问题,我很难搞清楚。我想逐行阅读一个html文件,但我想跳过HEAD标签。因此,我想我可以在跳过HEAD标签后开始阅读文本。

到目前为止,我已创建:

BufferedReader reader = new BufferedReader(new InputStreamReader(socket.getInputStream()));

StringBuilder string = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
    if (line.startsWith("<html>")) 
        string.append(line + "\n");
}

我想在没有HEAD信息的情况下将html代码保存在内存中。

示例:

<HTML>

<HEAD>

    <TITLE>Your Title Here</TITLE>

</HEAD>

<BODY BGCOLOR="FFFFFF">

    <CENTER><IMG SRC="clouds.jpg" ALIGN="BOTTOM"> </CENTER>

    <a href="http://somegreatsite.com">Link Name</a>is a link to another nifty site

    <H1>This is a Header</H1>

    <H2>This is a Medium Header</H2>

    Send me mail at <a href="mailto:support@yourcompany.com">support@yourcompany.com</a>.

</BODY>

我想保存除标签信息之外的所有内容。

1 个答案:

答案 0 :(得分:1)

这样的事情怎么样 -

boolean htmlFound = false;                        // Have we found an open html tag?
StringBuilder string = new StringBuilder();       // Back to your code...
String line;
while ((line = reader.readLine()) != null) {
  if (!htmlFound) {                               // Have we found it yet?
    if (line.toLowerCase().startsWith("<html")) { // Check if this line opens a html tag...
      htmlFound = true;                           // yes? Excellent!
    } else {
      continue;                                   // Skip over this line...
    }
  }
  System.out.println("This is each line: " + line);
  string.append(line + "\n");
}