Question

我想使用java从web抓取一些数据，但是我发现页面在到达页面末尾时会加载数据。我不是Web开发人员，也不知道当滚动到达页面末尾时，他们使用哪种技术来加载数据。

你可以给我一些提示吗？他们使用了哪种技术？当我不想使用浏览器时如何读取数据？（我用java编写了一个代码，使用urlConnection从站点读取数据。

网站就是这样的“https://www.healthtap.com/#topics/Women%27s%20health”。

感谢。

Answer 1

它是网络爬虫机器人的常见“问题”...... 某些页面包含从包含的源添加的动态内容。此内容可以在页面加载或触发时加载（如您的示例 - 向下滚动）。当目标页面被下载并且刮取时，DOM结构没有，在大多数情况下，包括外部包含数据的html元素。

我建议你做的是确定这些数据的源路径，这可以通过仔细检查DOM上的scrips来完成。并称他为次要来源其中包括您需要的所有缺失数据。

编辑：

在您链接的示例中 - 很简单：

      - install firebug.
      - scroll down the page to check the script that fires the request.
      - now you can see the link and the vars that are used for dynamicly adding the content.

www.healthtap.com/#topics/Women%27s%20health：

dinamyclly回复链接：

https://www.healthtap.com/topics/Women%27s%20health.json？ extended_categories = 1＆安培; AUTH_TOKEN =假安培; per_page = 8＆安培;页= 7＆安培; per_page = 8＆安培;的auth_token =假安培; generate_token =真

你可以看到你可以使用的一些参数：

 1/ topics/ + the page firs value name + .json?
 2/ per page= num -> how much results to return
 3/ generate_token=true -> its a security value but just change it to false and it work fine....

现在您可以使用此链接并加载所需的所有数据，并将其与您抓取的主页合并。

测试！

当页面在页面末尾动态加载数据时，从Web爬行数据

1 个答案: