Question

目标：获取动态加载页面的页面来源；方法：java +硒+ gecokdriver; 问题：在Windows中，一切正常。但是当我在Centos中部署它时。结果不是我预期的；

我正试图将搜寻器代码从本地计算机部署到centos7服务器。（该代码在我的计算机上可以正常工作。）但是当我在Centos服务器中将其解包时，当然还要重新配置相关信息-firefox.bin + gecokdriver.sh。我得到的源页面在呈现之前似乎是页面代码。

// the page url I want
String url = "http://rd.huangpuqu.sh.cn/website/html/shprd/shprd_ztrd_cwh/List/list_1.htm"


// Crawler code 
public class MyCrawlerUtils {

    private static final AtomicLong counter = new AtomicLong();

    private static org.apache.log4j.Logger logger = Logger.getLogger(MyCrawlerUtils.class);

    public static Document getOriginalPage(String url) {
        // 设置浏览器使用的本地驱动
        String firefoxDriver = Global.getConfig("firefox.driver");
        System.setProperty("webdriver.gecko.driver",firefoxDriver); // 0.24.0 从配置文件中获取配置信息

        // 设置浏览器在本地的位置  如果是默认的安装位置，则不需要设置
        String firefoxExe = Global.getConfig("firefox.execute");
        System.setProperty("webdriver.firefox.bin", firefoxExe);


        FirefoxOptions options = new FirefoxOptions();
        options.addArguments("disable-infobars");
        options.addArguments("--headless");
        options.setHeadless(true);

        // 创建驱动对象；
        FirefoxDriver driver = new FirefoxDriver(options);

        // 向指定网址发送请求
        driver.get(url);

        // 等待一段时间
        try {
            Thread.sleep(3000); 
        } catch (InterruptedException e) {
            e.printStackTrace(); 
        }

        String pageSource = driver.getPageSource();
        logger.info("{{{"+pageSource+"}}}"); // !!! there I get something unexpected.
        Document document = Jsoup.parse(pageSource);


        logger.info("第"+counter.incrementAndGet()+"条数据,"+"页面URL："+url);

        // 关闭驱动
        driver.quit();
        return document;
    }
}

我希望呈现pageSource，以便我可以解析所需的信息。像这样：欢迎您，下午好！2019年7月5日14:46:47 ...

但是我得到的只是这个（就像原始的pageCod）：

 <html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <!--[if lt IE 9]><script r='m'>document.createElement("section")</script><![endif]-->
    </head>
<body>
<input 
    type="hidden" 
    id="__onload__" 
    name="qLsp0ZDBKQUw_70MRYeJh0bkMr.oUykkn2yj1KXRhPucI8hFjVeSpylsPEgk8gowdN0vGovDjIqFiTyyzVRJo44Js_zY9Bhwx9lUgTQJk8RZnIFQfdLRR4p7VLDx00SPA41uZw4PYM2VDSXuiOeF6KZLDZT2Jmkfn.E_KlSSYwq" 
    value="U17W7zqe6L3khRlEHvj1WG">


</body>
</html>

Answer 1

现在这很愚蠢，但是我让代码等待更长的时间，然后我得到了正确的pageSource。只需将“ Thread.sleep（3000）”更改为“ Thread.sleep（5000）”，一切都会正常进行。

请注意，在Web环境中部署项目时，必须考虑网络延迟。这是所有人的共同问题。非常感谢你〜

我正在尝试使用硒获取页面源，但是在centos中不起作用

1 个答案: