使用rvest和css选择器从搜索结果中提取表格

时间:2015-06-15 16:43:32

标签: html css r rvest

刚刚在Hadley的杰出webinar上了解了NSURL *baseURL = [NSURL URLWithString:urlString]; AFOAuth2Manager *oAuthManager = [[AFOAuth2Manager alloc] initWithBaseURL:baseURL clientID: keyClientID secret:keyClientSecret]; [oAuthManager authenticateUsingOAuthWithURLString:@"/oauth/token" username:myUserName password:myPassword scope:@"email" success:^(AFOAuthCredential *credential) { NSLog(@"Token: %@",credential.accessToken); } failure:^(NSError *error) { NSLog(@"Error: %@", error); }]; //Authorizing requests AFHTTPRequestOperationManager *manager = [[AFHTTPRequestOperationManager alloc] initWithBaseURL:baseURL]; [manager.requestSerializer setAuthorizationHeaderFieldWithUsername:keyClientID password:keyClientSecret]; [manager GET:@"/path/to/protected/resource" parameters:nil success:^(AFHTTPRequestOperation *operation, id responseObject) { NSLog(@"Succes: %@", responseObject); } failure:^(AFHTTPRequestOperation *operation, NSError *error) { NSLog(@"Failure: %@", error); }]; //storing credentials AFOAuthCredential *credential = [[AFOAuthCredential alloc] init]; [AFOAuthCredential storeCredential: credential withIdentifier:serviceProviderIdentifier]; //retrieving credentials AFOAuthCredential *storedCredential = [AFOAuthCredential retrieveCredentialWithIdentifier:serviceProviderIdentifier]; NSLog(@"%@", storedCredential); ,并且第一次尝试了它。

我想抓取(然后绘制)从Google搜索结果返回的棒球排名表。

我的问题是我无法进入我在浏览器插件中看到的表rvest

enter image description here

rvest

此搜索应返回一个页面,其中的表格位于多个图层下,但由library(rvest) library(magrittr) # for %>% operator ( g_search <-html_session(url = "http://www.google.com/?q=mlb+standings", add_headers("user-agent" = "Mozilla/5.0")) ) # <session> http://www.google.com/?q=mlb+standings # Status: 200 # Type: text/html; charset=UTF-8 # Size: 52500 唯一标识。快速停在CSS Diner教我(我认为)&#34; div.tb_strip&#34;是一个有效的CSS选择器来捕获此表(可能还有其他垃圾)。事实上,使用Firebug的CSS选择器,我看到了完整的路径:

<div class="tb_strip">

但是,由于# Use Firebug "Copy CSS Path" and paste into table_path table_path <- "html body#gsr.srp.tbo.vasq div#main div#cnt.big div.mw div#rcnt div.col div#center_col div#res.med div#search div div#ires ol#rso li.g.tpo.knavi.obcontainer div.kp-blk div#uid_0.r-iCGI_bFBahQE.xpdbox.xpdopen div div.lr_container.mod div#lr_tab_unit_uid_1.tb_u.r-igQv_rxlT08k div.tb_view div.tb_strip" 返回空列表,以下尝试访问此表失败。

html_nodes

内容似乎没有进入g_search,所以我还不知道CSS选择器是否有效。

( standings <- g_search %>% 
    html_nodes("div.tb_strip") %>% 
    html_table() 
  ) #returns empty list

它去了哪里?

TYVM

1 个答案:

答案 0 :(得分:2)

这是一个更容易的网站的例子......

library("rvest")
url <- "http://sports.yahoo.com/mlb/standings/"
html(url) %>% html_nodes(".yui3-tabview-content") %>% html_nodes("table") %>%html_table