Question

我对这里的网页抓取是陌生的，并且我试图在此网站上提取公司的信息： http://apps.asiainsurancereview.com/IDA/Asp/CompanyList.aspx?company=&type=&jobType=&country=&search=company

我想提取的信息是下面的链接（这是上面第一个链接从表中列出的第一家公司）：

http://apps.asiainsurancereview.com/IDA/Asp/IDA_CompanyDetails.aspx?person=&designation=&company=&country=&search=company&comslno=272

我正在尝试提取第一个链接中列出的每个公司的详细信息（电话号码，电子邮件，网站等），然后将它们导出到.csv文件中。

但是，问题在于网站URL中的数字不是按顺序排列的，例如，第一家公司的信息URL与上面相同，以“ comslno = 272”结尾，而第二家公司的URL以“结尾” comslno = 1824“

我尝试了以下R代码（我知道这可能不可行）

library(rvest)
library(dplyr)

directory <- lapply(paste0('http://apps.asiainsurancereview.com/IDA/Asp/IDA_CompanyDetails.aspx?person=&designation=&company=&country=&search=company&comslno=', 1:9999999),
                    function(url){
                      url %>% read_html() %>% 
                        html_nodes("tr td") %>% 
                        html_text()
                    })

write.csv(directory, file = "directory.csv")

但是，它无效，因为URL不适用于1：9999999之间的所有数字。

例如，以“ comslno = 1”和“ comslno = 2”结尾的URL存在，但“ comslno = 3”不存在。

是否有一种方法可以使R忽略那些不存在的URL并继续该过程？还是有其他更简单的方法来做到这一点？

Answer 1

您可以使用tryCatch。

library(rvest)

directory <- lapply(paste0('http://apps.asiainsurancereview.com/IDA/Asp/IDA_CompanyDetails.aspx?person=&designation=&company=&country=&search=company&comslno=', 
                           c(2:5, 1)),
                    function(url) {
                      tryCatch(
                        url %>% read_html() %>%
                          html_nodes("tr td") %>% 
                          html_text(),
                        error = function(e) NULL
                        )
                    })

write.csv(do.call(rbind, directory), file = "directory.csv")

Answer 2

如果只希望第一页（或者需要合并其他页面的循环），则可以使用css选择器定位DOM中的适当节点，并提取那些匹配节点的适当属性，这将为您提供您需要连接到基本url上的数字。我展示了两个不同的例子。未注释的应该更快。

library(rvest)
library(magrittr)

base = 'http://apps.asiainsurancereview.com/IDA/Asp/IDA_CompanyDetails.aspx?person=&designation=&company=&country=&search=company&comslno='
p <- read_html('http://apps.asiainsurancereview.com/IDA/Asp/CompanyList.aspx?company=&type=&jobType=&country=&search=company')
#urls <- paste0(base, p %>% html_nodes('#tableList tr[id]') %>% html_attr('id'))
urls <- paste0(base, p %>% html_nodes('.select') %>% html_attr('value'))

您可以计算要循环的页面数

t <- p %>% html_node('#MainContent_pagination li:last-child') %>% html_text() %>% trimws()
total_results <- as.numeric(tail(str_split(t,' ')[[1]],1))
results_per_page = 15
num_pages = ceiling(total_results / results_per_page)

使用这种从1到num_pages循环，收集实际id然后循环最后一组url的策略，与发出1到9,999,999相比，您总体上的请求要少。

网页抓取目录-R

2 个答案: