如何使用htmlunitdriver进行网页抓取?

时间:2014-04-02 09:45:49

标签: java selenium-webdriver web-scraping htmlunit-driver

我得到像enter image description here这样的东西 嗨,我正在使用Selenium Webdriver创建一个网页,我能够实现我的数据,但问题是这与浏览器直接交互,我不想打开网页浏览器,并希望抓取所有数据,因为它是

我如何实现目标

这是我的代码

    import org.openqa.selenium.By;
    import org.openqa.selenium.WebDriver;
    import org.openqa.selenium.WebElement;
    import org.openqa.selenium.firefox.FirefoxDriver;
    import org.openqa.selenium.support.ui.Select;

    public class GetData {

        public static void main(String args[]) throws InterruptedException {
            String sDate = "27/03/2014";
            WebDriver driver = new FirefoxDriver();
            String url="http://www.upmandiparishad.in/commodityWiseAll.aspx";
            driver.get(url);
            Thread.sleep(5000);
            // select barge
            new Select(driver.findElement(By.id("ctl00_ContentPlaceHolder1_ddl_commodity"))).selectByVisibleText("Jo");
             driver.findElement(By.id("ctl00_ContentPlaceHolder1_txt_rate")).sendKeys(sDate);
            // click buttonctl00_ContentPlaceHolder1_txt_rate
            Thread.sleep(3000);
            driver.findElement(By.id("ctl00_ContentPlaceHolder1_btn_show")).click();
            Thread.sleep(5000);

            //get only table tex
            WebElement findElement = driver.findElement(By.id("ctl00_ContentPlaceHolder1_GridView1"));
            String htmlTableText = findElement.getText();
            // do whatever you want now, This is raw table values.
        System.out.println(htmlTableText);


            driver.close();
            driver.quit();

        }
    }


My updated New code



import com.gargoylesoftware.htmlunit.BrowserVersion;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.support.ui.Select;

    public class Getdata1 {

        public static void main(String args[]) throws InterruptedException {
            WebDriver driver = new HtmlUnitDriver(BrowserVersion.FIREFOX_3_6);
        driver.get("http://www.upmandiparishad.in/commodityWiseAll.aspx");
        System.out.println(driver.getPageSource());
        Thread.sleep(5000);
        // select barge         
        new Select(driver.findElement(By.id("ctl00_ContentPlaceHolder1_ddl_commodity"))).selectByVisibleText("Jo");

        String sDate = "12/04/2014"; //What date you want
        driver.findElement(By.id("ctl00_ContentPlaceHolder1_txt_rate")).sendKeys(sDate);

        driver.findElement(By.id("ctl00_ContentPlaceHolder1_btn_show")).click();
        Thread.sleep(3000);

        //get only table tex
        WebElement findElement = driver.findElement(By.id("ctl00_ContentPlaceHolder1_GridView1"));
        String htmlTableText = findElement.getText();
        // do whatever you want now, This is raw table values.
        System.out.println(htmlTableText);

        driver.close();
        driver.quit();

        }
    }

提前致谢

1 个答案:

答案 0 :(得分:1)

使用Selenium的HtmlUnit或HtmlUnitDriver

    WebDriver driver = new HtmlUnitDriver(BrowserVersion.FIREFOX_17);
    driver.get("http://www.upmandiparishad.in/commodityWiseAll.aspx");
    System.out.println(driver.getPageSource());
    Thread.sleep(5000);
    // select barge         
    new Select(driver.findElement(By.id("ctl00_ContentPlaceHolder1_ddl_commodity"))).selectByVisibleText("Jo");

    String sDate = "12/04/2014"; //What date you want
    driver.findElement(By.id("ctl00_ContentPlaceHolder1_txt_rate")).sendKeys(sDate);

    driver.findElement(By.id("ctl00_ContentPlaceHolder1_btn_show")).click();
    Thread.sleep(3000);

    //get only table tex
    WebElement findElement = driver.findElement(By.id("ctl00_ContentPlaceHolder1_GridView1"));
    String htmlTableText = findElement.getText();
    // do whatever you want now, This is raw table values.
    System.out.println(htmlTableText);

    driver.close();
    driver.quit();

要获得表格输出,您可以尝试这样的事情..

    String arrCells[] = htmlTableText.split(" ");
    Boolean bIsANumber = false;
    for(int i = 0; i < arrCells.length; i++) {

        try {
            int tmp = Integer.parseInt(arrCells[i]);
            bIsANumber = true;
        }
        catch(Exception ex) {
            bIsANumber = false;
        }

        if(bIsANumber) {
            System.out.print("\n"+arrCells[i]+"\t");
        }
        else {
            System.out.print(arrCells[i]+"\t");
        }
    }