Question

我正在尝试从一个学术项目的房地产门户网站中抓取数据。我感兴趣的数据是价格趋势，位于iframe中。我想获取上限，平均值和下限范围的数据。该数据存储在输入标签中。我正在尝试通过引用父类然后到输入标签来爬取这些数据，但无法获取数据。

我需要抓取很多iframe，但其中一个是this

我尝试过的代码如下，但没有得到想要的结果。

#Specifying the url of the iframe to be scraped
url <- 'https://www.99acres.com/do/pricetrends?building_id=0&loc_id=12400&prop_type=1&pref=S&bed_no=0&w=600&h=350'

#Reading the HTML code from the website
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
webpage <- read_html("scrapedpage.html")

PriceTrend_data_html <- html_nodes(webpage,'.ptplay input')

PriceTrend_data_html

如果有人可以在这里指导我，那将提供极大的帮助。

Answer 1

经过一些研究，我能够自己解决它，因此将其发布在这里，以防将来其他人遇到相同的问题。当我使用download.file（）下载文件时，我无法使用read_html（）读取html文件，因此必须手动下载该文件然后对其进行处理。

由于数据仅在输入标签内，因此我用输入标签的 id 抓取了属性，并获得了所需的数据。这是对我有用的代码。

url <- read_html("scrapedpage_chart.html")
average_prices <- html_attr(html_nodes(url, "#priceTrendVariables"), "median")
average_prices <- gsub(pattern = 'null',replacement = 'NA',x = average_prices)
average_prices <- unlist(strsplit(average,split = ","))
average_prices <- as.numeric(average)
average_prices

如何使用R

1 个答案: