I'm trying to scrape or obtain the text of Disqus comments from an online local newspaper using RSelenium in Chrome but am finding the going a little tough for my capabilities. I have searched many places but did not find the right information or I am using the wrong search terms (most probably).
So far I have managed to get the "normal" html from the pages but cannot pinpoint the right class, css selector or id to get the Disqus comments. I have also tried Selectorgadget but this only points to #dsq-app2
which selects the whole Disqus area at once and does not allow to select smaller parts of the area. I tried the same with RSelenium using elems <- mybrowser$findElement(using = "id", "dsq-app2")
with an "environment" being stored in elems
. Then I tried to find child elements within elems
but came up blank.
Viewing the page via developer tools I can see that the interesting stuff is within an iframe called #dsq-app2
and have managed to extract all its source through the elems$getPageSource()
after switching to the frame using elems$switchToFrame("dsq-app2")
. This outputs all the html as one big "dirty" chunk and short of searching for the required stuff held in <p>
tags and other elements of interest such as poster's usernames in data-role="username"
and others, I don't seem to find the right way forward.
I have also tried using the advice given here but the Disqus setup is a little different. One of the pages I'm trying is this with the bulk of the comments area within a section called conversation
and a ton of other id's such as posts
and the un-ordered list with the id=post-list
that ultimately carries the comments I need to scrape.
Any ideas or help tips are most welcome and received with thanks.
答案 0 :(得分:1)
经过大量的测试和实验,我做到了。我不知道它是否是最干净或最漂亮的解决方案,但它有效。希望其他人会发现它有用。基本上我所做的就是找到仅指向评论的网址。这可以在&#34; dsq-app2&#34; iframe
并且是attribute
,名为src
。起初我也转向iframe,但发现没有。
remDr$navigate("toTheRequiredPage")
elemsource <- remDr$findElement(using = "id", value = "dsq-app2")
src <- elemsource$getElementAttribute("src") # find the src attribute within the iframe`
remDr$navigate(src[[1]]) # navigate to the src url
# find the posts from the new page
elem <- remDr$findElement(using = "id", value = "posts")
elem.posts <- elem$findChildElements(using = "id", value = "post-list")
elem.msgs <- elem.posts[[1]]$findChildElements(using = "class name", value = "post-message")
length(elem.msgs)
msgtext <- elem.msgs[[1]]$getElementText() # find first post's text
msgtext # print message
更新:我发现如果我使用remDr$switchToFrame("dsq-app2")
,我就不需要像上面所解释的那样使用src
网址了。所以实际上有两种刮刮方式;
switchToFrame("nameOfFrame")
或src
网址的解决方案
醇>
希望这更清楚。