从网页抓取数据时,不会返回某些元素/值。
具体来说,我使用rvest包来废弃。
包含我想要的信息的网页是https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/ - 但是,当我废弃数据时,价格只会返回“$ - ”。
示例代码:
library(rvest)
webpage <- read_html("https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/")
tbls <- html_nodes(webpage, "table")
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[1:(length(tbls)-2)] %>%
html_table()
第一个df的输出:
> List of 22 $ :'data.frame': 7 obs. of 6 variables: ..$ Instance
> : chr [1:7] "B1L" "B1S" "B2S" "B1MS" ... ..$ Cores
> : int [1:7] 1 1 2 1 2 4 8 ..$ RAM
> : chr [1:7] "0.50 GiB" "1.00 GiB" "4.00 GiB" "2.00 GiB" ... ..$
> Temporary Storage : chr [1:7] "1 GiB" "2
> GiB" "8 GiB" "4 GiB" ... ..$ Price
> : chr [1:7] "$-" "$-" "$-" "$-" ... ..$ Prices with Azure Hybrid
> Benefit1 (% savings): chr [1:7] "$-" "$-" "$-" "$-" ...
我可以做些什么来获得这些特定元素的全部价值?
答案 0 :(得分:0)
无论过滤器如何,它们都有一组价格数据。所以你需要获取该属性的值并解析json。
library(rvest)
webpage <- read_html("https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/")
tbls <- html_nodes(webpage, "table")
webpage %>%
html_nodes("table") %>%
.[1:(length(tbls)-2)] %>%
html_table()
ss <- webpage %>% html_nodes("table span.price-data ") %>% xml_attr('data-amount')
lapply(ss,function(x){data.frame(jsonlite::fromJSON(x))})
示例输出:
[[176]]
regional.asia.pacific.southeast regional.australia.east regional.canada.central regional.canada.east
1 1.496 1.496 1.376 1.376
regional.europe.west regional.japan.east regional.united.kingdom.south regional.us.east.2 regional.usgov.virginia
1 1.488 1.464 1.448 1.373 1.504
regional.us.west regional.us.west.2
1 1.376 1.248
[[177]]
regional.asia.pacific.southeast regional.australia.east regional.canada.central regional.canada.east
1 4.464 4.464 4.224 4.224
regional.europe.west regional.japan.east regional.united.kingdom.south regional.us.east.2 regional.usgov.virginia
1 4.448 4.4 4.368 4.365 4.48
regional.us.west regional.us.west.2
1 4.224 3.968
您需要匹配该特定值并从中获取价格。