我正在尝试从不同的网站上抓取联系方式,因此他们有不同的css选择器。
数据:
website status email phone
1 http://www.saudiacatering.com/en/home NA info@noorinvestment.com +966 12-686-0011
2 http://www.laithllc.com/contact.html NA info@laithllc.com +971 2-553-7571
Fax cssr
1 +966 12-686-1864 .w-icon li
2 +971 2-353-7579 p+ p , section:nth-child(1) p
我已经阅读了多个网页抓取的每一篇文章,他们都有类似的网址和类似的css选择器/ xpath。
我确实尝试过:
library(rvest)
i<- str_replace_all(file$website, "http://www.[.]+", "")
urls<- "http://www."
cssr<- as.vector(file$cssr)
for (i in urls){
a01 <- paste0("http://www.",i, sep="")
text <- read_html(a01) %>%
html_nodes(cssr) %>%
html_text()
假设基本网址为http://www。并且添加的是网站链接的其余部分。但不成功。
任何类似的应用程序,我使用的是正确的软件包吗?
新代码:
library(stringr)
library(rvest)
library(magrittr)
i<- str_replace_all(url, "http://www.", "")
urls<- "http://www."
cssr<- as.vector(file$cssr)
for (x in i){
a01 <- paste0("http://www.",x, sep="")
read_html(a01)%>%
for(m in cssr){html_nodes(m) %>%html_text()}}
Error in for (. in m) file$cssr :
4 arguments passed to 'for' which requires 3
答案 0 :(得分:1)
考虑@ Spacedman的评论,也许这就是你想要的:
file <- read.table(header = TRUE, stringsAsFactors = FALSE, text =
'website status email phone Fax cssr
http://www.saudiacatering.com/en/home NA info@noorinvestment.com "+966 12-686-0011" "+966 12-686-1864" ".w-icon li"
http://www.laithllc.com/contact.html NA info@laithllc.com "+971 2-553-7571" "+971 2-353-7579" "p+ p , section:nth-child(1) p"')
library(dplyr)
library(purrr)
library(rvest)
mutate(file, text = map2(website, cssr, ~ read_html(.x) %>% html_nodes(.y) %>% html_text()))
# website status email phone Fax cssr text
# 1 http://www.saudiacatering.com/en/home NA info@noorinvestment.com +966 12-686-0011 +966 12-686-1864 .w-icon li +966 (12) 686-0011, +966 (12) 686-1864, careers@saudiacatering.com
# 2 http://www.laithllc.com/contact.html NA info@laithllc.com +971 2-553-7571 +971 2-353-7579 p+ p , section:nth-child(1) p