我正在尝试下载google网络搜索的内容
<h3 class="r">
如下图所示
我尝试使用rvest
包在R中编写这样的选择器但没有结果。
任何人都知道选择器应该如何?
> library(rvest)
>
> googleContent <- html("https://www.google.pl/#q=wiadomosci") %>%
+ html_nodes( "h3[class=r]" )
> googleContent
list()
attr(,"class")
[1] "XMLNodeSet"
> googleContent <- html("https://www.google.pl/#q=wiadomosci") %>%
+ html_nodes( "h3.r" )
> googleContent
list()
attr(,"class")
[1] "XMLNodeSet"
我尝试了其他套餐,但我不喜欢凌乱的代码...... (更改了此article的代码)
> # load packages
> library(RCurl)
> library(XML)
> library(dplyr)
> get_google_page_urls <- function(u) {
+ # read in page contents
+ html <- getURL(u)
+
+ # parse HTML into tree structure
+ doc <- htmlParse(html)
+
+ # extract url nodes using XPath. Originally I had used "//a[@href][@class='l']" until the google code change.
+ links <- xpathApply(doc, "//h3//a[@href]", function(x) xmlAttrs(x)[[1]])
+
+ # free doc from memory
+ free(doc)
+
+ # ensure urls start with "http" to avoid google references to the search page
+ links <- grep("http://", links, fixed = TRUE, value=TRUE)
+ return(links)
+ }
>
> u <- "http://www.google.pl/search?aq=f&gcx=w&sourceid=chrome&ie=UTF-8&q=wiadomosci"
> get_google_page_urls(u) %>% grep( pattern = "/url", value = TRUE) %>% strsplit( "?q=") %>%
+ lapply( function(element){ strsplit( element[2], ".pl" )[[1]][1] } ) %>%
+ unlist() %>% paste0(".pl") %>% unique()
[1] "http://wiadomosci.onet.pl" "http://www.tvn24.pl" "http://tvnwarszawa.tvn24.pl"
[4] "http://wiadomosci.wp.pl" "http://warszawa.gazeta.pl" "http://wiadomosci.gazeta.pl"
[7] "http://wiadomosci.tvp.pl" "http://www.se.pl"
这可能会有所帮助吗? 我不明白这个功能,因为文档很差
search <- html_form(html("https://www.google.com"))[[1]]
set_values(search, q = "My little pony")