Question

您好我正在写信给您，因为我正在试图找到一种方法并从网页中删除数据（＆＃34; https://nabtu.org/about-nabtu/official-directory/building-trades-local-councils-overview/＆＃34;）。我这样做是为了练习，只是为了学习如何废弃数据。我正在试图废弃上述网页（办公室，传真，电子邮件）的联系方式，但我无法做到，因为没有确定的css路径我可以使用Selectorgadget。我正在使用R和我使用的脚本就像这样。

library(rvest)
page_name <-read_html("page html")


page_name %>%
html_node("selector gadget node") %>%
html_text()

我抓取了所有其他数据，我只是不能废弃这些联系信息。任何帮助将不胜感激，因为我的头会被打击。谢谢你。

Answer 1

我不知道问题出在哪里。每个联系人块都有一个.council-list列表类。使用它，您可以单独提取联系信息。然后，使用一些字符串/正则表达式操作来提取确切的字段。

library(rvest)
page_name <- read_html('https://nabtu.org/about-nabtu/official-directory/building-trades-local-councils-overview/')
contact_strings = page_name %>%
  html_nodes('.council-list') %>%
  html_text()

# Filter out strings that don't contain contact information
contact_strings = grep(x = contact_strings, 'Email|Fax|office', ignore.case = T, value = T)

# Extract infomration 
library(stringr)
library(magrittr)
office = str_extract(contact_strings, 'Office:[^[:alpha:]]*')
fax = str_extract(contact_strings, 'Fax:[^[:alpha:]]*')
email = str_extract(contact_strings, 'Email: [^ ]*')

没有css路径R的数据抓取文本

1 个答案: