如何使用rvest从查询结果中检索标题

时间:2016-11-04 01:00:37

标签: html r rvest

我使用rvest从谷歌查询结果中检索标题。我的代码是这样的:

> url = URLencode(paste0("https://www.google.com.au/search?q=","600d"))
> page <- read_html(url)
> page %>% 
     html_nodes("a") %>%
     html_text()

然而,结果不仅包括标题,还包括其他内容,例如:

 [24] "Past month"                                                                        
 [25] "Past year"                                                                         
 [26] "Verbatim"                                                                              
 [27] "EOS 600D - Canon"                                                                  
 [28] "Similar"                                                                           
 [29] "Canon 600D | BIG W"                                                                
 [30] "Cached"                                                                                
 [31] "Similar"   
 ......
 [45] ""                                                                                          
 [46] ""                    

我需要的是[27]“EOS 600D - Canon”和[29]“Canon 600D | BIG W”。它们在Google查询中显示如下:enter image description here

所有其他人对我来说都只是噪音。谁能告诉我如何摆脱那些?

另外,如果我也想要描述部分,我应该做什么?

1 个答案:

答案 0 :(得分:2)

要获取标题,请不要使用<a>(=链接),<h3>

page %>% 
  html_nodes("h3") %>%
  html_text()

 [1] "EOS 600D - Canon"                                                   
 [2] "Canon EOS 600D - Wikipedia"                                         
 [3] "Canon 600D | BIG W"                                                 
 [4] "Canon EOS 600D Digital SLR Camera with 18-55mm IS Lens Kit ..."     
 [5] "Canon Rebel T3i / EOS 600D Review: Digital Photography Review"      
 [6] "Canon EOS 600D review - CNET"                                       
 [7] "canon eos 600d | Cameras | Gumtree Australia Free Local Classifieds"
 [8] "Images for 600d"                                                    
 [9] "Canon 600D - Snapsort"                                              
[10] "Canon EOS 600D - Georges Cameras"