Question

I'm trying to scrape or obtain the text of Disqus comments from an online local newspaper using RSelenium in Chrome but am finding the going a little tough for my capabilities. I have searched many places but did not find the right information or I am using the wrong search terms (most probably).

So far I have managed to get the "normal" html from the pages but cannot pinpoint the right class, css selector or id to get the Disqus comments. I have also tried Selectorgadget but this only points to #dsq-app2 which selects the whole Disqus area at once and does not allow to select smaller parts of the area. I tried the same with RSelenium using elems <- mybrowser$findElement(using = "id", "dsq-app2") with an "environment" being stored in elems. Then I tried to find child elements within elems but came up blank.

Viewing the page via developer tools I can see that the interesting stuff is within an iframe called #dsq-app2 and have managed to extract all its source through the elems$getPageSource() after switching to the frame using elems$switchToFrame("dsq-app2"). This outputs all the html as one big "dirty" chunk and short of searching for the required stuff held in <p> tags and other elements of interest such as poster's usernames in data-role="username" and others, I don't seem to find the right way forward.

I have also tried using the advice given here but the Disqus setup is a little different. One of the pages I'm trying is this with the bulk of the comments area within a section called conversation and a ton of other id's such as posts and the un-ordered list with the id=post-list that ultimately carries the comments I need to scrape.

Any ideas or help tips are most welcome and received with thanks.

Answer 1

经过大量的测试和实验，我做到了。我不知道它是否是最干净或最漂亮的解决方案，但它有效。希望其他人会发现它有用。基本上我所做的就是找到仅指向评论的网址。这可以在＆＃34; dsq-app2＆＃34; iframe并且是attribute，名为src。起初我也转向iframe，但发现没有。

remDr$navigate("toTheRequiredPage")
elemsource <- remDr$findElement(using = "id", value = "dsq-app2")
src <- elemsource$getElementAttribute("src") # find the src attribute within the iframe`
remDr$navigate(src[[1]]) # navigate to the src url

# find the posts from the new page
elem <- remDr$findElement(using = "id", value = "posts")
elem.posts <- elem$findChildElements(using = "id", value = "post-list")
elem.msgs <- elem.posts[[1]]$findChildElements(using = "class name", value = "post-message")

length(elem.msgs)
msgtext <- elem.msgs[[1]]$getElementText() # find first post's text
msgtext # print message

更新：我发现如果我使用remDr$switchToFrame("dsq-app2")，我就不需要像上面所解释的那样使用src网址了。所以实际上有两种刮刮方式;

使用switchToFrame("nameOfFrame")或
使用我之前使用iframe

src

希望这更清楚。

RSelenium scraping for Disqus comments

1 个答案: