Question

我正在试图抓住网站属性24网站。但是它返回不在页面上的额外数据行。这是我的代码。

library(rvest)
property<- read_html("https://www.property24.com/houses-for-sale/cape-
   town/western-cape/432")
price <-property%>% html_nodes(".p24_price") %>% html_text()
desc  <-property%>% html_nodes(".p24_excerpt")%>%html_text()
title <-property%>% html_nodes(".p24_title")%>%html_text() 



price = gsub("[^0-9]","", price) 
desc = gsub("[ \t]{2,}", "", desc) 
desc = gsub("\r\n", "", desc) 
desc = strtrim(desc,100)

property_table<-data.frame(price,title,desc)

Answer 1

问题在于price，title，desc向量的长度不同。

为什么？看看他们的内容。

您会发现某些值看起来不是合适的价格或描述。因为模式.p24_price和.p24_excerpt不够具体。您需要查看页面源，并使模式更具体。例如，这会更好：

price <- property %>% html_nodes(".p24_content .p24_price") %>% html_text()
desc  <- property %>% html_nodes(".p24_content .p24_excerpt") %>% html_text()
title <- property %>% html_nodes(".p24_content .p24_title") %>% html_text()

但我发现至少还有一个问题。有些房产有多个价格，例如：

从R 12 250 000到R 13 995 999

因此，使用gsub提取价格部分的方式也需要改进。

在R中进行webscraping

1 个答案: