Question

我正在开发一个java Web应用程序，我想知道如何从某个网站获取某个字段（表和/或输出文本）值。假设这个组件始终具有相同的ID，那么任何人都知道如何检索此信息？我不知道是否有人曾经遇到过这个问题，但如果有人有任何想法请分享。谢谢。

Answer 1

一般情况下： 1.）通过HTTPConnection将页面标记读取到应用程序中的URL来检索页面标记 2.）使用像jsoup这样的框架解析标记并检索你需要的值。

更具体地说，这是jsoup的一些示例代码：

HttpClient http = new DefaultHttpClient();
String htmlcode = "";
HttpGet request = new HttpGet("http://www.example.com");
HttpResponse response = null;
try {
    response = http.execute(request);
} catch (ClientProtocolException e) {
    e.printStackTrace();
} catch (IOException e) {
    e.printStackTrace();
}
if(response != null){
    BufferedReader read = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));

    String line = "";
    while((line = read.readLine()) != null){
        htmlcode += line; 
    }
}
// at this point we have the pages markup
Document doc = Jsoup.parse(htmlcode);
Elements lis = doc.getElementsByTag("li"); // get all entries in lists
for(Element el : lis){
    String val = el.text().trim();
    // do something for each list entry
}

Answer 2

您正在谈论网页抓取，请查看此库以获取php：

http://simplehtmldom.sourceforge.net/

如何从某个网站检索特定信息？

2 个答案: