Question

我在 Java 中做一个项目。在这个项目中，我必须使用DOM。为此，我首先使用Selenium加载任何给定URL的动态页面。然后我用Jsoup解析它们。

我想获取给定网址的动态网页源代码

代码快照：

public static void main(String[] args) throws IOException {

     // Selenium
     WebDriver driver = new FirefoxDriver();
     driver.get("ANY URL HERE");  
     String html_content = driver.getPageSource();
     driver.close();

     // Jsoup makes DOM here by parsing HTML content
     Document doc = Jsoup.parse(html_content);

     // OPERATIONS USING DOM TREE
}

但问题是，Selenium占整个处理时间的95％左右，这是不可取的。

Selenium首先打开Firefox，然后加载给定页面，然后获取动态页面源代码。

你能告诉我如何通过用另一种有效的工具替换这个工具来减少Selenium所花费的时间。任何其他建议也会受到欢迎。

编辑NO。 1

此link上提供了一些代码。

FirefoxProfile profile = new FirefoxProfile();
profile.setPreference("general.useragent.override", "some UA string");
WebDriver driver = new FirefoxDriver(profile);

但这里的第二行是什么，我不明白。由于文档中的硒也很差。

编辑第2号

System.out.println（“获取％s ...”+ url1）; System.out.println（“获取％s ...”+ url2）;

    WebDriver driver = new FirefoxDriver(createFirefoxProfile());

    driver.get("url1");  
    String hml1 = driver.getPageSource();

    driver.get("url2");
    String hml2 = driver.getPageSource();
    driver.close();

    Document doc1 = Jsoup.parse(hml1);
    Document doc2 = Jsoup.parse(hml2);

Answer 1

试试这个：

public static void main(String[] args) throws IOException {

    // Selenium
    WebDriver driver = new FirefoxDriver(createFirefoxProfile());
    driver.get("ANY URL HERE");
    String html_content = driver.getPageSource();
    driver.close();

    // Jsoup makes DOM here by parsing HTML content
    // OPERATIONS USING DOM TREE
}

private static FirefoxProfile createFirefoxProfile() {
    File profileDir = new File("/tmp/firefox-profile-dir");
    if (profileDir.exists())
        return new FirefoxProfile(profileDir);
    FirefoxProfile firefoxProfile = new FirefoxProfile();
    File dir = firefoxProfile.layoutOnDisk();
    try {
        profileDir.mkdirs();
        FileUtils.copyDirectory(dir, profileDir);
    } catch (IOException e) {
        e.printStackTrace();
    }
    return firefoxProfile;
}

createFireFoxProfile（）方法创建一个配置文件（如果不存在）。如果配置文件已存在，则使用它。因此，selenium不需要每次都创建profile-dir结构。

Answer 2

如果您确定，对您的代码充满信心，可以使用phantomjs。它是一个无头浏览器，可以快速点击您的结果。 FF需要一段时间才能执行。

Selenium花费大量时间来获取给定URL的动态页面

2 个答案: