Question

我正在尝试从不同的网站上抓取联系方式，因此他们有不同的css选择器。

数据：

website status                   email            phone
1 http://www.saudiacatering.com/en/home     NA info@noorinvestment.com +966 12-686-0011
2  http://www.laithllc.com/contact.html     NA       info@laithllc.com  +971 2-553-7571
               Fax                          cssr
1 +966 12-686-1864                    .w-icon li
2  +971 2-353-7579 p+ p , section:nth-child(1) p

我已经阅读了多个网页抓取的每一篇文章，他们都有类似的网址和类似的css选择器/ xpath。

我确实尝试过：

library(rvest)
        i<- str_replace_all(file$website, "http://www.[.]+", "")
    urls<- "http://www."
    cssr<- as.vector(file$cssr)
    for (i in urls){
      a01 <- paste0("http://www.",i, sep="")
      text <- read_html(a01) %>%
        html_nodes(cssr) %>% 
        html_text()

假设基本网址为http://www。并且添加的是网站链接的其余部分。但不成功。

任何类似的应用程序，我使用的是正确的软件包吗？

新代码：

library(stringr)
library(rvest)
library(magrittr)
    i<- str_replace_all(url, "http://www.", "")
urls<- "http://www."
cssr<- as.vector(file$cssr)
for (x in i){
  a01 <- paste0("http://www.",x, sep="")
  read_html(a01)%>%
for(m in cssr){html_nodes(m) %>%html_text()}}

    Error in for (. in m) file$cssr : 
  4 arguments passed to 'for' which requires 3

Answer 1

考虑@ Spacedman的评论，也许这就是你想要的：

file <- read.table(header = TRUE, stringsAsFactors = FALSE, text =
'website  status  email  phone   Fax  cssr
http://www.saudiacatering.com/en/home NA info@noorinvestment.com "+966 12-686-0011" "+966 12-686-1864" ".w-icon li"
http://www.laithllc.com/contact.html NA  info@laithllc.com  "+971 2-553-7571" "+971 2-353-7579" "p+ p , section:nth-child(1) p"')

library(dplyr)
library(purrr)
library(rvest)
mutate(file, text = map2(website, cssr, ~ read_html(.x) %>% html_nodes(.y) %>% html_text()))
#                                 website status                   email            phone              Fax                          cssr                                                               text
# 1 http://www.saudiacatering.com/en/home     NA info@noorinvestment.com +966 12-686-0011 +966 12-686-1864                    .w-icon li +966 (12) 686-0011, +966 (12) 686-1864, careers@saudiacatering.com
# 2  http://www.laithllc.com/contact.html     NA       info@laithllc.com  +971 2-553-7571  +971 2-353-7579 p+ p , section:nth-child(1) p

在不同的css选择器的网页矢量运行rvest

1 个答案: