Question

我正在做一个应该采用整个网站-html文本并将其放入字符串的应用程序。然后我想使用System.out.println来显示该字符串的一个特定片段。我的代码

import java.net.*;
import java.io.*;

public class URLConnectionReader {
    public static void main(String[] args) throws Exception {

        URL oracle = new URL("www.example-blahblahblah.com");
        BufferedReader in = new BufferedReader(
        new InputStreamReader(oracle.openStream()));

        String inputLine;
        while ((inputLine = in.readLine()) != null)

       System.out.println(inputLine.substring(inputLine.indexOf("<section class=\"horoscope-content\"><p>")+1, inputLine.lastIndexOf("</p")));

        in.close();
    }
}

它应该显示下面输入的文字：

<section class="horoscope-content">
    <p>Text text text text</p>

而不是我有这个：

线程“main”中的异常java.lang.StringIndexOutOfBoundsException：字符串索引超出范围：-1 at java.lang.String.substring（Unknown Source）在URLConnectionReader.main（URLConnectionReader.java:14）

我该怎么办？

Answer 1

对于输入的微小修改，您应该使用更宽容的正则表达式而不是indexOf更稳定：

Pattern pattern = Pattern.compile("<section\\s+class\\s*=\\s*\"horoscope-content\"\\s*>\\s*<p>(.*?)</p>", Pattern.DOTALL);
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
    System.out.println(matcher.group());
    System.out.println("Text in paragraph: " + matcher.group(1));
}

这可以容忍换行符和其他空白字符。

Answer 2

您的代码每次检查while语句中的条件时都会重新分配inputLine，具体取决于HTML，您可能希望在查找标记部分之前读取整个文件。
除非你肯定HTML包含那些文本部分，否则当它不存在时你仍然会得到例外你也只是在开始时将索引增加1，如果你不想要开头的文本输出，你将不得不增加开头部分的长度。

您可以尝试这样的事情：

StringBuilder html = new StringBuilder(); //holds all of the html we read
String inputLine;
while ((inputLine = in.readLine()) != null) { //read line by line
  html.append(inputLine); //add line to html
}
inputLine = html.toString(); //get 
String startText = "<section class=\"horoscope-content\"><p>"; //starting tag
int start = inputLine.indexOf(startText);
int end = inputLine.lastIndexOf("</p"); //might want to use something like inputLine.indexOf("</p>", start); if there are multiple sections on the page
if(start >= 0 && end >= 0) { //make sure we found a section
  System.out.println(inputLine.substring(start+startText.length(), end)); //print everything between the start and end tags (excluding the text in the start tag)
} else {
  System.out.println("section not found"); //do something else since we didn't find the tags
}

需要一定的字符串部分

2 个答案: