我正在尝试使用JSoup或Selenium Web Driver从此page获取信息。 这是我的Selenium实现:
package reddit;
import java.util.logging.Level;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
public class Reddit {
public static void main(String[] args) {
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
WebDriver driver = new HtmlUnitDriver();
while(true){
driver.get("https://www.reddit.com/r/RocketLeagueExchange/new/");
WebElement post = driver.findElement(By.cssSelector("p.title"));
String platform = post.findElement(By.className("linkflairlabel")).getText();
System.out.println("Platform:"+platform);
}
}
}
我试图从页面获取的信息是:
<span class="linkflairlabel" title="STEAM">STEAM</span>
这是第二个孩子:
<p class="title"><a class="title may-blank loggedin srTagged" data-event-action="title" href="/r/RocketLeagueExchange/comments/59nosd/pc_h_list_of_items_w_crates/" tabindex="1" rel="nofollow">[PC] [H] List of Items [W] Crates</a><span class="linkflairlabel" title="STEAM">STEAM</span> <span class="domain">(<a href="/r/RocketLeagueExchange/">self.RocketLeagueExchange</a>)</span></p>
问题是它不会获取文本。我也尝试使用cssSelector()
但过了一段时间它给了我这个错误:
Exception in thread "main" org.openqa.selenium.NoSuchElementException: Unable to find an element with xpath .//*[contains(concat(' ',normalize-space(@class),' '),' linkflairlabel ')]
Driver info: driver.version: HtmlUnitDriver
at org.openqa.selenium.htmlunit.HtmlUnitWebElement.findElementByXPath(HtmlUnitWebElement.java:725)
at org.openqa.selenium.By$ByClassName.findElement(By.java:392)
at org.openqa.selenium.htmlunit.HtmlUnitDriver$5.call(HtmlUnitDriver.java:1725)
at org.openqa.selenium.htmlunit.HtmlUnitDriver$5.call(HtmlUnitDriver.java:1)
at org.openqa.selenium.htmlunit.HtmlUnitDriver.implicitlyWaitFor(HtmlUnitDriver.java:1367)
at org.openqa.selenium.htmlunit.HtmlUnitDriver.findElement(HtmlUnitDriver.java:1721)
at org.openqa.selenium.htmlunit.HtmlUnitWebElement.findElement(HtmlUnitWebElement.java:655)
at reddit.Reddit.main(Reddit.java:21)
JSoup有时会做这项工作有时不会。
我知道这是一个noob问题,但这是我的第一次尝试,我错过了什么?
编辑: 当我运行该程序时,它警告:
ott 27, 2016 3:08:36 PM
com.gargoylesoftware.htmlunit.html.InputElementFactory createElementNS
INFORMAZIONI: Bad input type: "email", creating a text input
ott 27, 2016 3:08:37 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
AVVERTENZA: CSS error: 'https://www.redditstatic.com/reddit.YBQ3OGUgns4.css' [1:95984] Error in expression. (Invalid token " ". Was expecting one of: <NUMBER>, "inherit", <IDENT>, <STRING>, <HASH>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <RESOLUTION_DPI>, <RESOLUTION_DPCM>, <PERCENTAGE>, <DIMENSION>, <URI>, <FUNCTION>.)
ott 27, 2016 3:08:38 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
AVVERTENZA: CSS error: 'https://www.reddit.com/r/RocketLeagueExchange/new/' [1:1] Error in style sheet. (Invalid token "<". Was expecting one of: <EOF>, <S>, <IDENT>, "<!--", "-->", ".", ":", "*", "[", <HASH>, <IMPORT_SYM>, <PAGE_SYM>, <MEDIA_SYM>, <FONT_FACE_SYM>, <CHARSET_SYM>, <ATKEYWORD>.)
ott 27, 2016 3:08:38 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
AVVERTENZA: CSS error: 'https://b.thumbs.redditmedia.com/iInvQHcXVppWeQbAiLLVmDIZeWaC3nY_GVyunvQi0Hw.css' [1:57883] Error in simple selector. (Invalid token "{". Was expecting one of: <S>, <IDENT>, ".", ":", "*", "[", <HASH>.)
ott 27, 2016 3:08:38 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning
AVVERTENZA: CSS warning: 'https://b.thumbs.redditmedia.com/iInvQHcXVppWeQbAiLLVmDIZeWaC3nY_GVyunvQi0Hw.css' [1:57883] Ignoring the whole rule.
EDIT2:我转储了页面的html源代码,它包含了我想要的类/文本。所以Selenium / Jsoup由于某种原因缺失了。
EDIT3:
使用:
(new WebDriverWait(driver, 30)).until(new ExpectedCondition<Boolean>() {
public Boolean apply(WebDriver d) {
return d.findElement(By.cssSelector(locator)).getText().length() != 0;
}
});
无论如何都不会修复。
答案 0 :(得分:0)
如果不深入研究您的问题,我会发现问题,您使用WebElement post = driver.findElement(By.cssSelector("p.title"));
将返回适合选择器的第一个元素。它不会返回所有元素。它找到的第一个元素不包含类linkflairlabel
的内部元素,你的代码将不起作用。
我看到两个解决方案:
1)
返回符合css选择器p.title
的所有元素并循环遍历它们以找到您感兴趣的元素。使用List<WebElement> posts = driver.findElements(By.cssSelector("p.title"));
。
2)
使用更具体的css选择器,如下所示:List<WebElement> linkFlairPosts = driver.findElements(By.cssSelector("p.title span.linkflairlabel"));
这将一次性获取相关元素。
我更喜欢使用方法2,因为它更简洁。