Jsoup响应不像浏览器检查

时间:2018-08-23 11:15:16

标签: spring reactjs jsoup

我想用jsoup解析网页。但是返回的html不像浏览器检查一样。从浏览器访问网页时,我可以在<ol>下看到<section id='js_item_list_section'>......</section>标签。但是,如果我在Spring Boot项目中使用jsoup调用网页,则在该部分下看不到<ol>标签。本节下还有另一个<div key="">。返回的响应如下:

JSOUP响应:

<section id="js_item_list_section" class="item-list item-list--loading clearfix">
 <div key="itemlist-loader" class="ellipsis-loader-wrapper ellipsis-loader-wrapper--text ellipsis-loader-wrapper--top">
  <div class="ellipsis-loader ellipsis-loader--branded center-x">
   <div class="ellipsis-loader__dot"></div>
   <div class="ellipsis-loader__dot"></div>
   <div class="ellipsis-loader__dot"></div>
  </div>
  <span class="loader-text center-x">Y&uuml;kleniyor</span>
 </div>
</section>

网络浏览器(Chrome)检查器:

<section id="js_item_list_section" class="item-list clearfix">
  <ol>
     <li>.....</li>
     <li>.....</li>
  <ol>
</section>

我认为这与React.js有关。

也是我的代码块在这里:

Document document = Jsoup.connect(myUrl)
  .ignoreContentType(true)
  .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36")
  .get();
Element itemListSection  = document.getElementById("js_item_list_section");

2 个答案:

答案 0 :(得分:1)

问题很可能是您要解析的页面包含动态生成的内容(js_item_list_section已经暗示使用JavaScript呈现此内容)

JSoup不会解释JavaScript,因此也不会加载通过AJAX调用访问的内容。因此,不幸的是,无法使用JSoup。

我看到您有两种选择:

1)使用硒Web驱动程序之类的工具,该工具可以从Java控制真正的浏览器,从而还可以解析动态生成的内容。这很容易实现,但是引入了新的依赖项(整个浏览器!),并且运行起来很慢。

2)分析加载JavaScript用来呈现页面的内容的AJAX调用。使用浏览器的开发人员工具查找实际呼叫。然后直接从Java内部调用它并解析该数据。通常,此类数据以JSON格式传输,因此Jsoup在这里仅提供有限的帮助。此选项需要更多的精力,但运行速度更快,并且不会给项目增加更多的依赖关系。

答案 1 :(得分:0)

我尝试过这样的Web驱动程序:

System.setProperty(MyChromeExePath);
        WebDriver webDriver = new ChromeDriver();
        webDriver.get(trivagoUrl.toString());
        String pageSource = webDriver.getPageSource();

WebDriver webDriver = new ChromeDriver();行之后,浏览器打开。之后,它会抛出时间异常错误

2018-08-24 18:52:01.116[0;39m [31mERROR[0;39m [35m29316[0;39m [2m---[0;39m [2m[nio-8080-exec-6][0;39m [36mo.a.c.c.C.[.[.[/].[dispatcherServlet]   [0;39m [2m:[0;39m Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.openqa.selenium.WebDriverException: Timed out waiting for driver server to start.
Build info: version: '3.9.1', revision: '63f7b50', time: '2018-02-07T22:25:02.294Z'
System info: host: 'DESKTOP-RP0T36G', ip: '192.168.1.21', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_121'
Driver info: driver.version: ChromeDriver] with root cause

java.util.concurrent.TimeoutException: null
    at java.util.concurrent.FutureTask.get(Unknown Source) ~[na:1.8.0_121]
    at com.google.common.util.concurrent.SimpleTimeLimiter.callWithTimeout(SimpleTimeLimiter.java:148) ~[guava-23.6-jre.jar:na]
    at org.openqa.selenium.net.UrlChecker.waitUntilAvailable(UrlChecker.java:75) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.remote.service.DriverService.waitUntilAvailable(DriverService.java:187) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.remote.service.DriverService.start(DriverService.java:178) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:79) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:601) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:219) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:142) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:181) ~[selenium-chrome-driver-3.9.1.jar:na]
    at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:168) ~[selenium-chrome-driver-3.9.1.jar:na]
    at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:123) ~[selenium-chrome-driver-3.9.1.jar:na]
    at com.io.zizu.m2m.parse.command.TrivagoSearchCommand.getSearchResults(TrivagoSearchCommand.java:131) ~[main/:na]
    at com.io.zizu.m2m.parse.command.TrivagoSearchCommand$$FastClassBySpringCGLIB$$a6dcf772.invoke(<generated>) ~[main/:na]
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) ~[spring-core-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:746) ~[spring-aop-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.cache.interceptor.CacheInterceptor.lambda$invoke$0(CacheInterceptor.java:53) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.cache.interceptor.CacheAspectSupport.invokeOperation(CacheAspectSupport.java:336) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:391) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:316) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.cache.interceptor.CacheInterceptor.invoke(CacheInterceptor.java:61) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:185) ~[spring-aop-5.0.8.RELEASE.jar:5.0.8.RELEASE]