JSoup / Selenium Web驱动程序缺少内容

时间:2016-10-27 13:02:17

标签: html selenium web-scraping jsoup

我正在尝试使用JSoup或Selenium Web Driver从此page获取信息。 这是我的Selenium实现:

package reddit;

import java.util.logging.Level;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;

public class Reddit {

public static void main(String[] args) {
    java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF); 
    WebDriver driver = new HtmlUnitDriver();
    while(true){
        driver.get("https://www.reddit.com/r/RocketLeagueExchange/new/");
        WebElement post = driver.findElement(By.cssSelector("p.title"));
        String platform = post.findElement(By.className("linkflairlabel")).getText();
        System.out.println("Platform:"+platform);
    }
}

}

我试图从页面获取的信息是:

<span class="linkflairlabel" title="STEAM">STEAM</span>

这是第二个孩子:

<p class="title"><a class="title may-blank loggedin srTagged" data-event-action="title" href="/r/RocketLeagueExchange/comments/59nosd/pc_h_list_of_items_w_crates/" tabindex="1" rel="nofollow">[PC] [H] List of Items [W] Crates</a><span class="linkflairlabel" title="STEAM">STEAM</span> <span class="domain">(<a href="/r/RocketLeagueExchange/">self.RocketLeagueExchange</a>)</span></p>

问题是它不会获取文本。我也尝试使用cssSelector()但过了一段时间它给了我这个错误:

Exception in thread "main" org.openqa.selenium.NoSuchElementException: Unable to find an element with xpath .//*[contains(concat(' ',normalize-space(@class),' '),' linkflairlabel ')]
Driver info: driver.version: HtmlUnitDriver
at org.openqa.selenium.htmlunit.HtmlUnitWebElement.findElementByXPath(HtmlUnitWebElement.java:725)
at org.openqa.selenium.By$ByClassName.findElement(By.java:392)
at org.openqa.selenium.htmlunit.HtmlUnitDriver$5.call(HtmlUnitDriver.java:1725)
at org.openqa.selenium.htmlunit.HtmlUnitDriver$5.call(HtmlUnitDriver.java:1)
at org.openqa.selenium.htmlunit.HtmlUnitDriver.implicitlyWaitFor(HtmlUnitDriver.java:1367)
at org.openqa.selenium.htmlunit.HtmlUnitDriver.findElement(HtmlUnitDriver.java:1721)
at org.openqa.selenium.htmlunit.HtmlUnitWebElement.findElement(HtmlUnitWebElement.java:655)
at reddit.Reddit.main(Reddit.java:21)

JSoup有时会做这项工作有时不会。

我知道这是一个noob问题,但这是我的第一次尝试,我错过了什么?

编辑: 当我运行该程序时,它警告:

ott 27, 2016 3:08:36 PM 

com.gargoylesoftware.htmlunit.html.InputElementFactory createElementNS
INFORMAZIONI: Bad input type: "email", creating a text input
ott 27, 2016 3:08:37 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
AVVERTENZA: CSS error: 'https://www.redditstatic.com/reddit.YBQ3OGUgns4.css' [1:95984] Error in expression. (Invalid token " ". Was expecting one of: <NUMBER>, "inherit", <IDENT>, <STRING>, <HASH>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <RESOLUTION_DPI>, <RESOLUTION_DPCM>, <PERCENTAGE>, <DIMENSION>, <URI>, <FUNCTION>.)
ott 27, 2016 3:08:38 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
AVVERTENZA: CSS error: 'https://www.reddit.com/r/RocketLeagueExchange/new/' [1:1] Error in style sheet. (Invalid token "<". Was expecting one of: <EOF>, <S>, <IDENT>, "<!--", "-->", ".", ":", "*", "[", <HASH>, <IMPORT_SYM>, <PAGE_SYM>, <MEDIA_SYM>, <FONT_FACE_SYM>, <CHARSET_SYM>, <ATKEYWORD>.)
ott 27, 2016 3:08:38 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
AVVERTENZA: CSS error: 'https://b.thumbs.redditmedia.com/iInvQHcXVppWeQbAiLLVmDIZeWaC3nY_GVyunvQi0Hw.css' [1:57883] Error in simple selector. (Invalid token "{". Was expecting one of: <S>, <IDENT>, ".", ":", "*", "[", <HASH>.)
ott 27, 2016 3:08:38 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning
AVVERTENZA: CSS warning: 'https://b.thumbs.redditmedia.com/iInvQHcXVppWeQbAiLLVmDIZeWaC3nY_GVyunvQi0Hw.css' [1:57883] Ignoring the whole rule.

EDIT2:我转储了页面的html源代码,它包含了我想要的类/文本。所以Selenium / Jsoup由于某种原因缺失了。

EDIT3:

使用:

(new WebDriverWait(driver, 30)).until(new ExpectedCondition<Boolean>() {
                    public Boolean apply(WebDriver d) {
                        return d.findElement(By.cssSelector(locator)).getText().length() != 0;
                    }
                });

无论如何都不会修复。

1 个答案:

答案 0 :(得分:0)

如果不深入研究您的问题,我会发现问题,您使用WebElement post = driver.findElement(By.cssSelector("p.title"));将返回适合选择器的第一个元素。它不会返回所有元素。它找到的第一个元素不包含类linkflairlabel的内部元素,你的代码将不起作用。

我看到两个解决方案:

1) 返回符合css选择器p.title的所有元素并循环遍历它们以找到您感兴趣的元素。使用List<WebElement> posts = driver.findElements(By.cssSelector("p.title"));

2) 使用更具体的css选择器,如下所示:List<WebElement> linkFlairPosts = driver.findElements(By.cssSelector("p.title span.linkflairlabel"));这将一次性获取相关元素。

我更喜欢使用方法2,因为它更简洁。