Question

我正在尝试从具有Rselenium的网页收集URL，但是出现InvalidSelector错误

在Windows 10 PC Rselenium 1.7.5和Chrome Webdriver（chromever =“ 75.0.3770.8”）上使用R 3.6.0


library(RSelenium)

rD <- rsDriver(browser=c("chrome"), chromever="75.0.3770.8")
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()

url <- "https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=1&gen=0&ps=96"
remDr$navigate(url)

tt <- remDr$findElements(using = "xpath", "//a[contains(@href,'http://twitter.com/')]/@href")

我希望收集指向列出的政客的Twitter帐户的URL。相反，我遇到了下一个错误：

硒信息：

invalid selector: The result of the xpath expression "//a[contains(@href,'http://twitter.com/')]/@href" is: [object Attr]. It should be an element.
  (Session info: chrome=75.0.3770.80)
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/invalid_selector_exception.html
Build info: version: '4.0.0-alpha-1', revision: 'd1d3728cae', time: '2019-04-24T16:15:24'
System info: host: 'ALEX-DELL-17', ip: '10.0.75.1', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_191'
Driver info: driver.version: unknown

错误：摘要：InvalidSelector 详细信息：参数是无效的选择器（例如XPath / CSS）。类：org.openqa.selenium.InvalidSelectorException 更多详细信息：运行errorDetails方法

当我对非常具体的元素进行类似搜索时，所有方法都可以正常工作，例如：

tt <- remDr$findElement(value = '//a[@href = "http://twitter.com/AlboMP"]')

然后

tt$getElementAttribute('href')

返回我需要的URL

我在做什么错了？

Answer 1

此错误消息...

invalid selector: The result of the xpath expression "//a[contains(@href,'http://twitter.com/')]/@href" is: [object Attr]. It should be an element.

......暗示您的XPath表达式无效。

xpath表达式：

//a[contains(@href,'http://twitter.com/')]/@href

不返回任何元素。它将返回[object Attr]。尽管使用Selenium RC可以接受，但是WebDriver的 WebElement 接口的方法需要一个元素对象，而不仅仅是任何DOM节点对象。

总结起来，Selenium仍然不支持这种格式。并且要解决此问题，您需要更改HTML标记，以将文本节点包装在元素内，例如。

解决方案

要解决此问题，您需要使用findElements并创建一个 List ：

findElements(value = '//a[@href = "http://twitter.com/AlboMP"]')

现在，您可以遍历 List ，并使用getElementAttribute('href')方法来提取URL。

参考

InvalidSelectorError: The result of the xpath expression is: [object Text]

Answer 2

关于R我什么都没有，所以我用python发布了答案。由于这篇文章是关于R的，所以我学习了R的一些基础知识，并将其发布了。

获取Twitter URL的最简单方法是遍历网页中的所有URL，并检查其中是否包含“ twitter”。

在python中（绝对有效）：

driver.get('https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=1&gen=0&ps=96')
links = driver.find_elements_by_xpath("//a[@href]")
for link in links:
    if 'twitter' in link.get_attribute("href"):
        print(link.get_attribute("href")

结果：

http://twitter.com/AlboMP http://twitter.com/SharonBirdMP
  http://twitter.com/Bowenchris http://twitter.com/tony_burke
  http://twitter.com/lindaburneymp http://twitter.com/Mark_Butler_MP
  https://twitter.com/terrimbutler http://twitter.com/AnthonyByrne_MP
  https://twitter.com/JEChalmers http://twitter.com/NickChampionMP
  https://twitter.com/LMChesters http://twitter.com/JasonClareMP
  https://twitter.com/SharonClaydon
  https://www.twitter.com/LibbyCokerMP
  https://twitter.com/JulieCollinsMP http://twitter.com/fitzhunter
  http://twitter.com/stevegeorganas https://twitter.com/andrewjgiles
  https://twitter.com/lukejgosling https://www.twitter.com/JulianHillMP   http://twitter.com/stephenjonesalp https://twitter.com/gedkearney
  https://twitter.com/MikeKellyofEM http://twitter.com/mattkeogh
  http://twitter.com/PeterKhalilMP http://twitter.com/CatherineKingMP
  https://twitter.com/MadeleineMHKing https://twitter.com/ALEIGHMP
  https://twitter.com/RichardMarlesMP
  https://twitter.com/brianmitchellmp
  http://twitter.com/#!/RobMitchellMP
  http://twitter.com/ShayneNeumannMP https://twitter.com/ClareONeilMP
  http://twitter.com/JulieOwensMP
  http://www.twitter.com/GrahamPerrettMP
  http://twitter.com/tanya_plibersek http://twitter.com/AmandaRishworth   http://twitter.com/MRowlandMP https://twitter.com/JoanneRyanLalor
  http://twitter.com/billshortenmp http://www.twitter.com/annewerriwa
  http://www.twitter.com/stemplemanmp
  https://twitter.com/MThistlethwaite
  http://twitter.com/MariaVamvakinou https://twitter.com/TimWattsMP
  https://twitter.com/joshwilsonmp

在R中：（这可能是错误的，但是您可以找到一个主意）

library(XML)
library(RCurl)
library(RSelenium)
url <- "https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=1&gen=0&ps=96"
doc <- getURL(url)
parser <- htmlParse(doc)
links <- xpathSApply(parser, "//a[@href]", xmlGetAttr, "href")
for(link in links){
    if(grepl("twitter", link)){
        print(link)
    }
}

我什至不知道此代码是否有效。但想法是获取页面中的所有URL，对其进行迭代，然后检查其中是否包含twitter。该答案基于this

使用RSelenium获取网页中的所有Twitter链接

2 个答案:

解决方案

参考