在rvest中用哪个选择器写入来从谷歌网络搜索中提取信息?

时间:2015-03-14 18:41:47

标签: r selector rvest

我正在尝试下载google网络搜索的内容 <h3 class="r">如下图所示

我尝试使用rvest包在R中编写这样的选择器但没有结果。 任何人都知道选择器应该如何?

> library(rvest)
> 
> googleContent <- html("https://www.google.pl/#q=wiadomosci") %>% 
+    html_nodes( "h3[class=r]" )
> googleContent
list()
attr(,"class")
[1] "XMLNodeSet"
> googleContent <- html("https://www.google.pl/#q=wiadomosci") %>% 
+    html_nodes( "h3.r" )
> googleContent
list()
attr(,"class")
[1] "XMLNodeSet"

我尝试了其他套餐,但我不喜欢凌乱的代码...... (更改了此article的代码)

> # load packages
> library(RCurl)
> library(XML)
> library(dplyr)
> get_google_page_urls <- function(u) {
+    # read in page contents
+    html <- getURL(u)
+    
+    # parse HTML into tree structure
+    doc <- htmlParse(html)
+    
+    # extract url nodes using XPath. Originally I had used "//a[@href][@class='l']" until the google code change.
+    links <- xpathApply(doc, "//h3//a[@href]", function(x) xmlAttrs(x)[[1]])
+    
+    # free doc from memory
+    free(doc)
+    
+    # ensure urls start with "http" to avoid google references to the search page
+    links <- grep("http://", links, fixed = TRUE, value=TRUE)
+    return(links)
+ }
> 
> u <- "http://www.google.pl/search?aq=f&gcx=w&sourceid=chrome&ie=UTF-8&q=wiadomosci"
>  get_google_page_urls(u) %>% grep( pattern = "/url", value = TRUE) %>% strsplit( "?q=") %>%
+    lapply( function(element){ strsplit( element[2], ".pl" )[[1]][1] } ) %>%
+    unlist() %>% paste0(".pl") %>% unique()
[1] "http://wiadomosci.onet.pl"   "http://www.tvn24.pl"         "http://tvnwarszawa.tvn24.pl"
[4] "http://wiadomosci.wp.pl"     "http://warszawa.gazeta.pl"   "http://wiadomosci.gazeta.pl"
[7] "http://wiadomosci.tvp.pl"    "http://www.se.pl"   

这可能会有所帮助吗? 我不明白这个功能,因为文档很差

search <- html_form(html("https://www.google.com"))[[1]]


set_values(search, q = "My little pony")

example

0 个答案:

没有答案