我正在和“rvest”打交道。使用“read_html”获取数据是可以的。
library(rvest)
# suppressMessages(library(dplyr))
library(stringr)
library(XML)
# get house data
houseurl <- "http://boekhoff.de/immobilien/gepflegtes-zweifamilienhaus-in-ellwuerden/"
house <- read_html(houseurl)
house
我在处理数据时遇到一些问题。我的问题在来源中得到了评论。
## eleminating <br>-tags in address
# using the following commands causes error using "html_nodes"
str_extract_all(house,"<br>") ## show all linebreaks
# replacing <br> in whitespace " ",
house <- str_replace_all(house,"<br>", " ")
现在正在阅读详细信息,但看起来,这不起作用
houseattribut <- house %>%
html_nodes(css = "div.col-2 li p.data-left") %>%
html_text(trim=TRUE)
# shows "Error in UseMethod("xml_find_all") : ... "
# but all attributes are shown on screen
houseattribut
无需手动替换“br”-tags,但“html_text”将字符串收紧在一起
housedetails <- house %>%
html_nodes(css = "div.col-2 li p.data-right") %>%
html_text()
housedetails
# the same error shows "Error in UseMethod("xml_find_all") : ... "
# but all details are shown on screen
housedetails[4]
# in the source there is: "Ellwürder Straße 17<br>26954 Nordenham"
# at <br>-tag should be a whitespace
任何提示我做错了什么?
答案 0 :(得分:0)
问题在于,当您使用read_html
时,house
是xml_document
,在您使用str_replace_all
后它变为chr
,所以,当您使用时尝试再次过滤节点,它不再是xml_document
,它会给你错误。
您需要将其再次转换为xml_document
或按节点应用替换节点。
类似的东西:
house <- read_html(str_replace_all(house,"<br>", " "))
完整代码:
library(rvest)
#> Loading required package: xml2
library(stringr)
houseurl <- "http://boekhoff.de/immobilien/gepflegtes-zweifamilienhaus-in-ellwuerden/"
house <- read_html(houseurl)
house <- read_html(str_replace_all(house,"<br>", " "))
housedetails <- house %>%
html_nodes(css = "div.col-2 li p.data-right") %>%
html_text()
housedetails[4]
#> [1] "Ellwürder Straße 17 26954 Nordenham"