我正在尝试从Catholic Health Initiatives复制医院列表,地址和电话号码。
我使用的代码是:
# install.packages('rvest')
library('rvest')
htmlpage <- read_html("http://www.catholichealthinitiatives.org/landing.cfm?xyzpdqabc=0&id=39524&action=list")
chihtml <- html_nodes(htmlpage,".info , .address")
chi <- html_text(chihtml)
chi
library(stringr)
chi <- str_replace_all(chi, "[\r\n\t]" , "")
chi
这是标题结果:
[1] "CHI St. VincentTwo St. Vincent Cr.Little Rock, AR 72205P 501.552.3000F 501.552.4241"
[2] "Two St. Vincent Cr.Little Rock, AR 72205P 501.552.3000F 501.552.4241"
[3] "CHI St. Vincent Hot Springs300 Werner StreetHot Springs National Park, AR 71913P 501.622.1000"
[4] "300 Werner StreetHot Springs National Park, AR 71913P 501.622.1000"
[5] "CHI St. Vincent InfirmaryTwo St. Vincent CircleLittle Rock, AR 72205P 502.552.3000F 501.552.4241"
[6] "Two St. Vincent CircleLittle Rock, AR 72205P 502.552.3000F 501.552.4241"
我想删除主线下面的重复地址:
[1] "CHI EX: St. VincentTwo St. Vincent Cr.Little Rock, AR 72205P 501.552.3000F 501.552.4241"
## remove next line ##
[2] "Two St. Vincent Cr.Little Rock, AR 72205P 501.552.3000F 501.552.4241"
答案 0 :(得分:0)
只需在.info
中指定.address
或 html_nodes
,具体取决于您想要的内容:
chihtml <- html_nodes(htmlpage,".info")
chi <- html_text(chihtml, trim = TRUE) # `trim = TRUE` to strip whitespace
head(chi)
# [1] "CHI St. Vincent\nTwo St. Vincent Cr.Little Rock, AR 72205P 501.552.3000F 501.552.4241"
# [2] "CHI St. Vincent Hot Springs\n300 Werner StreetHot Springs National Park, AR 71913P 501.622.1000"
# [3] "CHI St. Vincent Infirmary\nTwo St. Vincent CircleLittle Rock, AR 72205P 502.552.3000F 501.552.4241"
# [4] "CHI St. Vincent Morrilton\nFour Hospital DriveMorrilton, AR 72110P 501.977.2300F 501.977.2400"
# [5] "CHI St. Vincent North\n2215 Wildwood AvenueSherwood, AR 72120P 501.977.2300F 501.977.2400"
# [6] "CHI St. Vincent Rehabilitation Hospital\n2201 Wildwood AvenueSherwood, AR 72120P 501.834.1800F 501.834.2227"