Question

我正在尝试使用Java获取网页的响应，并将其写入html文件，以便将来在我的本地引用。

许多网站日复一日地改变了内容。例如，在https://en.wikipedia.org/wiki/Main_Page中，页面每天都会显示不同的内容。

我收到维基百科主页的回复，并将其保存为html，昨天它与维基主页相同。

但今天，维基页面变得与众不同。我的html页面就像昨天一样。

如何检查响应是否不同。我第一次收到回复时会在数据库中添加什么，以及稍后调用同一个url时需要检查的内容。

这是我的代码，

URL url = new URL("https://en.wikipedia.org/wiki/Main_Page");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
byte[] encodedBytes = Base64.encodeBase64("root:pass".getBytes());
String encoding = new String(encodedBytes);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.connect();

InputStream content = (InputStream) connection.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(content));
try {
    fWriter = new FileWriter(new File("f:\\new.html"));
    writer = new BufferedWriter(fWriter);
    while ((line = in.readLine()) != null) {
        String s = line.toString();
        writer.write(s);    
    }               
    writer.close();
} catch (Exception e) {
    e.printStackTrace();
}

Answer 1

您可以尝试使用＆＃34; last-mofidied＆＃34;中包含的值。来自服务器的响应中的标头。将其解析为一个漂亮的对象可以进行简单的日期比较，让您检查是否应该重新刮擦。

将上次呈现日期的网站与从请求标题中获取的日期进行比较

site_modified_date = request.headers["Last-Modified"];

如果它们之间存在差异，则请检索并将新内容加载到页面

Answer 2

另一种方法是使用哈希函数（check https://docs.oracle.com/javase/7/docs/api/java/security/MessageDigest.html）来计算网站主体的哈希值。这将始终有效，即使Last-Modified标头不存在也是如此。您可以使用库jsoup（http://jsoup.org/）来检索正文：

Dim flag as Boolean: flag = False
Dim lng_cnt as long: long_cnt = 0
Dim elem_temp as IHTMLElement: Set elem_temp = IE.document.getElementById("*element_name*")
Do Unitl flag = True or lng_cnt = 30
    If elem_temp Is Nothing Then
        flag = False
    Else
        flag = True
        Exit Do
    End If
    lng_cnt = lng_cnt +1
    Application.Wait (Now() + TimeValue("00:00:01"))
Loop

Answer 3

我正在使用Cyril和Shree29提供的两个答案。

首先，我正在检查“上次修改”标题并将其保存在我的参考中。如果它为null，那么我正在计算哈希值并保存它。

感谢Cyril和Shree29。

同一页面的不同Web内容

3 个答案: