Question

是否有一种使用R从vcard提取数据的方法？我正在抓取一些网站，其中一个是-https://www.cwlaw.com/attorneys。

我需要从vcard（电子邮件）中收集信息。

Answer 1

这是提取具有“ mailto”的href并使用gsub将其删除。

 gsub("mailto:", "",grep("mailto:", read_html("https://www.cwlaw.com/attorneys")%>% html_nodes("a")%>% html_attr("href"), value= T))

Answer 2

这是从电子名片中提取电子邮件地址的一种简单方法。

这种方法使用CURL下载卡，使用grep查找带有字符串EMAIL的行，最后使用stringr :: str_split捕获该行的相关部分。

library(curl)
library(stringr)

con <- curl('https://www.cwlaw.com/vcard-82.vcf', open='')
card <- readLines(con)
str_split(grep('EMAIL', card, value=TRUE), 'CP1252:')[[1]][2]

提取Vcard信息-R抓取

2 个答案: