用户评论的数据提取

时间:2015-08-30 11:42:16

标签: xml r web-crawler data-extraction

我试图从个人自学的兴趣中学习R.既不是编码员也不是分析师。我想从Trip Advisor中提取用户评论。在单页我们有10条评论,但使用下面的代码,我也收到不需要的评论/行。我不确定我是否使用正确的html节点。此外,我想提取用户的完整评论,但其结尾给我一个用户的部分评论。你能帮我提一下10条的完整用户评论吗?非常感谢你的帮助。

  dat <- readLines("http://www.tripadvisor.in/Hotel_Review-g60763-d93450-Reviews-Grand_Hyatt_New_York-New_York_City_New_York.html", warn=FALSE)
  raw2 <- htmlTreeParse(dat, useInternalNodes = TRUE)
  ##User Review
  plain.text <- xpathSApply(raw2, "//div[@class='col2of2']//p[@class='partial_entry']", xmlValue)
  UR <-gsub("\\\n","",plain.text)
  Result <- unlist(UR)
  Result

1 个答案:

答案 0 :(得分:2)

这更像是一种网络抓取练习,而不是R编程。

在R中,我更喜欢httr包来获取http响应并将内容解析为解析后的html。使用readLines(...)只是最糟糕的方法。因此,下面的代码将提取审核摘要。

library(httr)
library(XML)
url <- "http://www.tripadvisor.in/Hotel_Review-g60763-d93450-Reviews-Grand_Hyatt_New_York-New_York_City_New_York.html"
response <- GET(url)
doc      <- content(response,type="text/html")
smry     <- xpathSApply(doc,'//div[@class="entry"]/p[@class="partial_entry"]',xmlValue)
length(smry)
# [1] 10
smry[1]
# [1] "\nThats all that matters really...I wonder if anyone would chose this hotel for any other factor at all...located right next to Grand central station in midtown and within walking distance of many tourist attractions, top restaurants and corp offices. Stayed 3 nights here on a business trip, I chose this hotel over others purely based on its location. Price is...\n\n\nMore  \n\n"

获得完整评论会更复杂,因为它涉及点击&#34;更多&#34;按钮。因此,当您单击&#34;更多&#34;时,您需要确定触发了哪些http请求。链接参考。您可以使用Firefox的开发人员工具(或许多其他工具,我确定)中的网络监视器选项卡执行此操作。事实证明,这是一种形式的链接:

http://www.tripadvisor.com/ExpandedUserReviews-g{xxx}-d{yyy}?querystring

其中{xxx}{yyy}对于酒店而言是唯一的,并且与原始网址中的相同,并且querystring在网络监视器工具中完全标识。因此,我们使用该URL和相应的查询字符串形成一个新的http请求,并解析结果,如下所示。

cls   <- doc['//div[@class="entry"]//span[contains(@class,"moreLink")]/@class']
xr.refno <- sapply(cls,function(x)sub(".*\\str(\\d+)\\s.*","\\1",x))
code     <- sub(".*Hotel_Review(\\-g\\d+\\-d\\d+)\\-Reviews.*","\\1",url)
xr.url   <- paste0("http://www.tripadvisor.com/ExpandedUserReviews",code)
xr.response <- GET(xr.url,query=list(target=xr.refno[1],
                                     context=1,
                                     reviews=paste(xr.refno,collapse=","),
                                     servlet="Hotel_Review",
                                     expand=1))
xr.doc   <- content(xr.response,type="text/html")
xr.full  <- xpathSApply(xr.doc,'//div[@class="entry"]/p',xmlValue)
length(xr.full)
# [1] 6
xr.full[1]
# [1] "\nThats all that matters really...I wonder if anyone would chose this hotel for any other factor at all...located right next to Grand central station in midtown and within walking distance of many tourist attractions, top restaurants and corp offices. Stayed 3 nights here on a business trip, I chose this hotel over others purely based on its location. Price is about average in NYC I think. Asked for a room with a good view and was given a 2 BR on the 30th floor. After checking in I realized there may not be the kind of view that I expected at all from any room in this hotel - due to it being surrounded by high rises in all directions. However, no other complaints as such - except may that the bathroom was a bit too cramped. That I guess is the norm in NYC. I would stay here again if it was a business visit based on the location. Faster than avg wifi (free) was a good plus.\n"

还有一个细微差别/问题。请注意,只有6&#34;扩展评论&#34;。这是因为简短的评论适合于部分评论&#34;格式,没有&#34;更多&#34;按钮。因此,您需要弄清楚哪些部分评论实际上是完整的。既然你说你正在学习R,我会把它留给你......