Question

我正在尝试从https://iaeme.com/ijmet/index.asp的网页下载所有pdf个文件。

页面有不同的链接，每个链接里面有多个下载和更多页面。我正试图浏览下一页并继续循环。

package flow;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardCopyOption;
import java.util.List;
import java.util.NoSuchElementException;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.tools.ant.taskdefs.Java;
import org.apache.tools.ant.types.FileList.FileName;
import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebDriver.Navigation;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.w3c.dom.Text;

import jxl.common.Assert;
//kindly ignore the imports 


public class excel {

    public static void main(String[] args) throws IOException, Exception {

        System.setProperty("webdriver.chrome.driver", "C:\\Users\\User_2\\Downloads\\chromedriver_win32\\chromedriver.exe");
        WebDriver d=new ChromeDriver();
        d.manage().window().maximize();
        d.get("https://iaeme.com/ijmet/index.asp");                  
        java.util.List<WebElement> catvalues=d.findElements(By.className("issue"));
        for(int i=0;i<=catvalues.size();i++){  
            catvalues.get(i).click();                    
            java.util.List<WebElement> downcount=d.findElements(By.linkText("Download"));
            System.out.println(downcount.size());

            for(int k=1;k<=downcount.size();k++){  
                downcount.get(k).click();                                                
                Thread.sleep(5000);                          
            }

            d.navigate().back();
            catvalues = d.findElements(By.className("issue"));
        }
    }  
}

我尝试了不同的失败方法。

Answer 1

如果您检查https://iaeme.com/ijmet/index.asp页面，您可以注意到，对于ID为 lik 的每个班级，都有 onclick 属性。在此属性中，您需要有信息才能打开所有感兴趣的页面。

示例：

模式是

onclick="journalpissue('8','9','IJMET')"

从此您必须创建此链接

https://iaeme.com/ijmet/issues.asp?JType=IJMET&VType=8&IType=9

所以，在这个例子中：

VType = 8 IType = 9 JType = IJMET

获得所有链接后，您可以迭代所有页面。

对于每个页面，您必须获得＆＃34; href＆＃34;的值。属性为class jounl 的所有元素的属性。

获得pdf链接后，我继续使用＆＃34; curl＆＃34;命令。如果您想使用selenium下载所有文件，请回答https://stackoverflow.com/a/37664671/3881320

public class Stackoverflow {

public static void main(String args[]) {
        WebDriver driver = new FirefoxDriver();
        driver.get("https://iaeme.com/ijmet/index.asp");
        java.util.List<WebElement> likValues = driver.findElements(By.className("lik"));
        LinkedList<String> allUrl = new LinkedList<>();
        String baseUrl = "https://iaeme.com/ijmet/";
        for (WebElement el : likValues) {
            String journalpissue = el.getAttribute("onclick");
            String relativeUrl = parseJournalpissue(journalpissue);
            allUrl.add(relativeUrl);
        }

        for (String url : allUrl) {
            analyzePage(driver, baseUrl + url, true);
        }

    }

private static void analyzePage(WebDriver driver, String url, boolean searchOtherPages) {
        driver.get(url);
        List<WebElement> allA = null;
        if (searchOtherPages) {
            List<WebElement> tdlist = driver.findElements(By.cssSelector("table[class='contant'] tr td"));
            WebElement pages = tdlist.get(tdlist.size() - 1);
            System.out.println(pages.getText());
            allA = pages.findElements(By.tagName("a"));
        }

        java.util.List<WebElement> jounl = driver.findElements(By.className("jounl"));
        for (WebElement wel : jounl) {
            String href = wel.getAttribute("href");
            if (href.contains(".pdf")) {
                System.out.println("File to download: " + href);
                downloadFile(href);
            }
        }

        if (allA != null) {
            for (WebElement a : allA) {
                String href = a.getAttribute("href");
                System.out.println(href);
                analyzePage(driver, href, false);
            }
        }
    }


private static void downloadFile(String file) {
        try {
            String[] CMD_COMPOSED = {
                "/bin/bash",
                "-c",
                "curl -O " + file,};
            String output;

            Process p = Runtime.getRuntime().exec(CMD_COMPOSED);
            StringBuilder outputBuilder;
            outputBuilder = new StringBuilder();
            BufferedReader reader = new BufferedReader(new InputStreamReader(p.getInputStream(), StandardCharsets.UTF_8));
            String line = null;

            while ((line = reader.readLine()) != null) {
                outputBuilder.append(line + "\n");
            }
            output = outputBuilder.toString();
        } catch (IOException ex) {
            Logger.getLogger(Stackoverflow.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

    private static String parseJournalpissue(String journalpissue) {
        String finalUrl = null;

        StringTokenizer st = new StringTokenizer(journalpissue, "'");
        st.nextToken();
        String vType = st.nextToken();
        st.nextToken();
        String iType = st.nextToken();
        st.nextToken();
        String jType = st.nextToken();

        finalUrl = "issues.asp?JType=" + jType + "&VType=" + vType + "&IType=" + iType;
        System.out.println(finalUrl);
        return finalUrl;

    }
}

注意：我没有考虑在其中一个页面（要下载的PDF文件）中可能会有更多页面（您的＆＃34;更多页面和＃34） ;，在你的描述中）。为此，您可以使用相同的方法。

修改

有关页数的信息在：

表格中有类别名称＆＃34;内容＆＃34;。特别是，是最后一个元素。

所以：

List<WebElement> tdlist = driver.findElements(By.cssSelector("table[class='contant'] tr td")); WebElement pages = tdlist.get(tdlist.size() - 1);

我们对＆＃34; a＆＃34;感兴趣的tagName：

List<WebElement> allA = pages.findElements(By.tagName("a"));

现在我们还有所有其他页面的网址。我们可以使用与之前相同的方法来下载pdf文件。

导航到for循环中的下一页而不会使循环失败？

1 个答案: