我需要在网站上从HTML中提取文本。这是我用来提取HTML代码的代码。
public static void readFromWeb(String webURL) throws IOException {
URL url = new URL(webURL);
InputStream is = url.openStream();
try( BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
String line;
while ((line = br.readLine()) != null) {
System.out.println(line);
}
}
catch (MalformedURLException e) {
e.printStackTrace();
throw new MalformedURLException("URL is malformed!!");
}
catch (IOException e) {
e.printStackTrace();
throw new IOException();
}
}
答案 0 :(得分:0)
使用JSoup:
public class Main {
public static void main(final String[] args) throws IOException {
System.out.println(readFromWeb("http://www.stackoverflow.com/"));
}
public static String readFromWeb(final String webUrl) throws IOException {
final Document doc = Jsoup.connect(webUrl).get();
return doc.text();
}
}