首先,我为这篇文章的篇幅道歉,因为我想详细说明我要做的事情。
我正在尝试完善我在R中编写的抓取应用程序来获取Disqus评论。到目前为止,我可以使用各种RSelenium功能获取特定页面上的所有注释。我现在要做的是从发布的评论中获取一种树结构,即首先得到最顶层的评论,然后检查这些评论是否有任何孩子。我正在使用的网站上的一个特定页面总共有34条评论,但其中只有18条是最顶层的。其余的都是孩子的孩子或孩子。
我正在做的是打开一个页面并在Chrome中创建一个webdriver,我使用selectorgadget找到正确的选择器,如下所示:
1. elem <- remDr$findElement(using = "id", value = "posts")
2. elem.posts <- elem$findChildElements(using = "id", value = "post-list")
3. elem.posts <- elem$findElements(using = 'css selector', value = '.post~ .post+ .post')
在上面的代码中,第1行找到了帖子部分,然后如果我使用第2行,我得到页面上的所有帖子,之后我使用以下行查找所有消息,所以如果有34条评论页面我得到了所有。
elem.msgs <- elem.posts[[1]]$findChildElements(using = 'css selector', '.post-message')
现在我已经意识到评论的“树”结构对我的数据项目很重要,我试图先获得最高评论,然后查看每个热门评论以找到任何可用的孩子。示例网页为here。要获得评论,我使用上面的第1行和第3行,结果是16的列表,如果我使用elem.posts[[1]]$getElementAttribute("id")
我获得了帖子ID,我可以用它来查找以后的每个热门评论。
这个16的列表应该是18,我无法理解为什么列表中没有捕获前两个注释。这种情况发生在其他页面中,其中列表中未捕获到许多最顶层的注释。
我的问题是:我可以尝试使用哪些内容,以便我可以获得页面上所有最顶层的评论,而不会有任何评论辍学?有没有更好的方法来获得最高评论而不经过我没有经验的迂回方式?
感谢您的帮助或指导。
答案 0 :(得分:1)
您可以使用递归函数来下载帖子。您只需要RSelenium来获取页面源:
library(xml2)
library(RSelenium)
library(jsonlite)
selServ <- startServer()
appURL <- "http://disqus.com/embed/comments/?base=default&version=90aeb3a56d1f2d3db731af14996f11cf&f=malta-today&t_i=article_67726&t_u=http%3A%2F%2Fwww.maltatoday.com.mt%2Fnews%2Fnational%2F67726%2Fair_malta_pilots_demands_30_basic_salary_increase&t_d=Air%20Malta%20pilots%E2%80%99%20demands%3A%2030%25%20basic%20salary%20increase%2C%20increased%20duty%20payments%2C%20double%20%E2%80%98denied%20leave%E2%80%99%20payment&t_t=Air%20Malta%20pilots%E2%80%99%20demands%3A%2030%25%20basic%20salary%20increase%2C%20increased%20duty%20payments%2C%20double%20%E2%80%98denied%20leave%E2%80%99%20payment&s_o=default"
remDr <- remoteDriver()
remDr$open()
remDr$navigate(appURL)
pgSource <- remDr$getPageSource()[[1]]
remDr$close()
selServ$stop()
doc <- read_html(pgSource)
appNodes <- xml_find_all(doc, "//ul[@id='post-list']/li[@class='post']")
# write recursive function to get
content_fun <- function(x){
main <- xml_find_all(x, "./div[@data-role]/.//div[@class='post-body']")
main <- list(
poster = xml_text(xml_find_all(main, ".//span[@class = 'post-byline']")),
posted = xml_text(xml_find_all(main, ".//span[@class = 'post-meta']")),
date = xml_attr(xml_find_all(main, ".//a[@class = 'time-ago']"), "title"),
message = xml_text(xml_find_all(main, ".//div[@data-role = 'message']"))
)
# check for children
children <- xml_find_all(x, "./ul[@class='children']/li[@class='post']")
if(length(children) > 0){
main$children <- lapply(children, content_fun)
}
main
}
postData <- lapply(appNodes, content_fun)
例如,这是第3篇帖子
> prettify(toJSON(postData[[3]]))
{
"poster": [
"\nMary Attard\n\n"
],
"posted": [
"\n•\n\n\na month ago\n\n"
],
"date": [
"Thursday, July 21, 2016 6:12 AM"
],
"message": [
"\nI give up. Air Malta should be closed down.\n"
],
"children": [
{
"poster": [
"\nJoseph Lawrence\n\n Mary Attard\n"
],
"posted": [
"\n•\n\n\na month ago\n\n"
],
"date": [
"Thursday, July 21, 2016 7:43 AM"
],
"message": [
"\nAir Malta should have been privatized or sold out right a long time ago. It is costing the TAX PAYER millions, it has for a long, long time.\n"
]
},
{
"poster": [
"\nJ.Borg\n\n Mary Attard\n"
],
"posted": [
"\n•\n\n\na month ago\n\n"
],
"date": [
"Thursday, July 21, 2016 5:23 PM"
],
"message": [
"\nYes - at this stage we taxpayers will be better off without Air Malta. We closed Malta Dry Docks and we survived. We can close Air Malta and we'll survive even better. After all, we have many more airlines serving us.\n"
]
}
]
}
您可以根据需要清理和抓取哪些内容。