RSelenium scraping for Disqus comments

时间:2016-07-11 20:16:26

标签: html css web-scraping webdriver rselenium

I'm trying to scrape or obtain the text of Disqus comments from an online local newspaper using RSelenium in Chrome but am finding the going a little tough for my capabilities. I have searched many places but did not find the right information or I am using the wrong search terms (most probably).

So far I have managed to get the "normal" html from the pages but cannot pinpoint the right class, css selector or id to get the Disqus comments. I have also tried Selectorgadget but this only points to #dsq-app2 which selects the whole Disqus area at once and does not allow to select smaller parts of the area. I tried the same with RSelenium using elems <- mybrowser$findElement(using = "id", "dsq-app2") with an "environment" being stored in elems. Then I tried to find child elements within elems but came up blank.

Viewing the page via developer tools I can see that the interesting stuff is within an iframe called #dsq-app2 and have managed to extract all its source through the elems$getPageSource() after switching to the frame using elems$switchToFrame("dsq-app2"). This outputs all the html as one big "dirty" chunk and short of searching for the required stuff held in <p> tags and other elements of interest such as poster's usernames in data-role="username" and others, I don't seem to find the right way forward.

I have also tried using the advice given here but the Disqus setup is a little different. One of the pages I'm trying is this with the bulk of the comments area within a section called conversation and a ton of other id's such as posts and the un-ordered list with the id=post-list that ultimately carries the comments I need to scrape.

Any ideas or help tips are most welcome and received with thanks.

1 个答案:

答案 0 :(得分:1)

经过大量的测试和实验,我做到了。我不知道它是否是最干净或最漂亮的解决方案,但它有效。希望其他人会发现它有用。基本上我所做的就是找到仅指向评论的网址。这可以在&#34; dsq-app2&#34; iframe并且是attribute,名为src。起初我也转向iframe,但发现没有。

remDr$navigate("toTheRequiredPage")
elemsource <- remDr$findElement(using = "id", value = "dsq-app2")
src <- elemsource$getElementAttribute("src") # find the src attribute within the iframe`
remDr$navigate(src[[1]]) # navigate to the src url

# find the posts from the new page
elem <- remDr$findElement(using = "id", value = "posts")
elem.posts <- elem$findChildElements(using = "id", value = "post-list")
elem.msgs <- elem.posts[[1]]$findChildElements(using = "class name", value = "post-message")

length(elem.msgs)
msgtext <- elem.msgs[[1]]$getElementText() # find first post's text
msgtext # print message

更新:我发现如果我使用remDr$switchToFrame("dsq-app2"),我就不需要像上面所解释的那样使用src网址了。所以实际上有两种刮刮方式;

  1. 使用switchToFrame("nameOfFrame")
  2. 使用我之前使用iframe
  3. 中的src网址的解决方案

    希望这更清楚。