用Java中的HTML代码提取文本

时间:2018-02-11 22:36:04

标签: java html text

我需要在网站上从HTML中提取文本。这是我用来提取HTML代码的代码。

public static void readFromWeb(String webURL) throws IOException {

        URL url = new URL(webURL);
        InputStream is =  url.openStream();
        try( BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
            String line;
            while ((line = br.readLine()) != null) {
                System.out.println(line);
            }
        }
        catch (MalformedURLException e) {
            e.printStackTrace();
            throw new MalformedURLException("URL is malformed!!");
        }
        catch (IOException e) {
            e.printStackTrace();
            throw new IOException();
            }
        } 

1 个答案:

答案 0 :(得分:0)

使用JSoup:

public class Main {

    public static void main(final String[] args) throws IOException {
        System.out.println(readFromWeb("http://www.stackoverflow.com/"));
    }

    public static String readFromWeb(final String webUrl) throws IOException {
        final Document doc = Jsoup.connect(webUrl).get();
        return doc.text();
    }
}