webscraping:手动替换标签

时间:2017-01-23 18:07:22

标签: r rvest

我正在和“rvest”打交道。使用“read_html”获取数据是可以的。

library(rvest)
# suppressMessages(library(dplyr))
library(stringr)
library(XML)

# get house data
houseurl <- "http://boekhoff.de/immobilien/gepflegtes-zweifamilienhaus-in-ellwuerden/"
house <- read_html(houseurl)
house

我在处理数据时遇到一些问题。我的问题在来源中得到了评论。

## eleminating <br>-tags in address
# using the following commands causes error using "html_nodes"
str_extract_all(house,"<br>") ## show all linebreaks
# replacing <br> in whitespace " ", 
house <- str_replace_all(house,"<br>", " ")

现在正在阅读详细信息,但看起来,这不起作用

houseattribut <- house %>%
html_nodes(css = "div.col-2 li p.data-left")   %>% 
html_text(trim=TRUE) 
# shows "Error in UseMethod("xml_find_all") : ... "
# but all attributes are shown on screen
houseattribut  

无需手动替换“br”-tags,但“html_text”将字符串收紧在一起

housedetails <- house %>%
html_nodes(css = "div.col-2 li p.data-right") %>% 
html_text()
housedetails
# the same error shows "Error in UseMethod("xml_find_all") : ... "
# but all details are shown on screen

housedetails[4]
# in the source there is: "Ellwürder Straße 17<br>26954 Nordenham"
# at <br>-tag should be a whitespace 

任何提示我做错了什么?

1 个答案:

答案 0 :(得分:0)

问题在于,当您使用read_html时,housexml_document,在您使用str_replace_all后它变为chr,所以,当您使用时尝试再次过滤节点,它不再是xml_document,它会给你错误。

您需要将其再次转换为xml_document或按节点应用替换节点。

类似的东西:

house <- read_html(str_replace_all(house,"<br>", " "))

完整代码:

library(rvest)
#> Loading required package: xml2
library(stringr)

houseurl <- "http://boekhoff.de/immobilien/gepflegtes-zweifamilienhaus-in-ellwuerden/"
house <- read_html(houseurl)

house <- read_html(str_replace_all(house,"<br>", " "))

housedetails <- house %>%
    html_nodes(css = "div.col-2 li p.data-right") %>% 
    html_text()

housedetails[4]
#> [1] "Ellwürder Straße 17 26954 Nordenham"