如何使用R刮除多个没有ID或类的表

时间:2017-08-18 11:53:05

标签: r web-scraping rvest

我试图使用R:http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all(所有页面)抓取此网页

我是编程新手。在我看过的任何地方,桌子大多都是用ID或Div或Class来识别的。在这个页面上没有。数据以表格格式存储。我该怎么刮呢?

这就是我所做的:

 library(rvest)
 webpage <- read_html("http://zipnet.in/index.php    
 page=missing_mobile_phones_search&criteria=browse_all")

 tbls <- html_nodes(webpage, "table")

 head(tbls)

tbls_ls <- webpage %>%
html_nodes("table") %>%
          .[9:10] %>%

html_table(fill = TRUE)

colnames(tbls_ls[[1]]) <- c("Mobile Make", "State", "District",
                         "Police Station", "Status", "Mobile Type(GSM/CDMA)", 
                         "FIR/DD/GD Dat")

1 个答案:

答案 0 :(得分:0)

您可以通过定位每个表的css id来抓取表数据。看起来每个页面由3个不同的表格一个接一个地粘贴组成。其中两个表具有#AutoNumber16 css id,而第三个表(中间)具有suppressMessages(library(tidyverse)) suppressMessages(library(rvest)) # define function to scrape the table data from a page get_page <- function(page_id = 1) { # default link link <- "http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all&Page_No=" # build link link <- paste0(link, page_id) # get tables data wp <- read_html(link) wp %>% html_nodes("#AutoNumber16, #AutoNumber15") %>% html_table(fill = TRUE) %>% bind_rows() } # get the data from the first three pages iter_page <- 1:3 # this is just a progress bar pb <- progress_estimated(length(iter_page)) # this code will iterate over pages 1 through 3 and apply the get_page() # function defined earlier. The Sys.sleep() part is used to pause the code # after each iteration so that the sever is not overloaded with requests. map_df(iter_page, ~ { pb$tick()$print() df <- get_page(.x) Sys.sleep(sample(10, 1) * 0.1) as_tibble(df) }) #> # A tibble: 72 x 4 #> X1 X2 X3 #> <chr> <chr> <chr> #> 1 FIR/DD/GD Number 000165 State #> 2 FIR/DD/GD Date 17/08/2017 District #> 3 Mobile Type(GSM/CDMA) GSM Police Station #> 4 Mobile Make SAMSUNG J2 Mobile Number #> 5 Missing/Stolen Date 23/04/2017 IMEI Number #> 6 Complainant AKEEL KHAN Complainant Contact Number #> 7 Status Stolen/Theft Report Date/Time on ZIPNET #> 8 <NA> <NA> <NA> #> 9 FIR/DD/GD Number FIR No 37/ State #> 10 FIR/DD/GD Date 17/08/2017 District #> # ... with 62 more rows, and 1 more variables: X4 <chr> css id。

我提出了一个简单的代码示例,可以帮助您开始正确的方向。

{{1}}