R:抓取网页时遗漏的值

时间:2017-11-06 14:34:09

标签: html r dataframe web-scraping missing-data

从网页抓取数据时,不会返回某些元素/值。

具体来说,我使用rvest包来废弃。

包含我想要的信息的网页是https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/ - 但是,当我废弃数据时,价格只会返回“$ - ”。

示例代码:

library(rvest)

webpage <- read_html("https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/")
tbls <- html_nodes(webpage, "table")

tbls_ls <- webpage %>%
  html_nodes("table") %>%
  .[1:(length(tbls)-2)] %>%
  html_table()

第一个df的输出:

> List of 22  $ :'data.frame':  7 obs. of  6 variables:   ..$ Instance   
> : chr [1:7] "B1L" "B1S" "B2S" "B1MS" ...   ..$ Cores                  
> : int [1:7] 1 1 2 1 2 4 8   ..$ RAM                                   
> : chr [1:7] "0.50 GiB" "1.00 GiB" "4.00 GiB" "2.00 GiB" ...   ..$
> Temporary Storage                            : chr [1:7] "1 GiB" "2
> GiB" "8 GiB" "4 GiB" ...   ..$ Price                                  
> : chr [1:7] "$-" "$-" "$-" "$-" ...   ..$ Prices with Azure Hybrid
> Benefit1 (% savings): chr [1:7] "$-" "$-" "$-" "$-" ...

我可以做些什么来获得这些特定元素的全部价值?

1 个答案:

答案 0 :(得分:0)

无论过滤器如何,它们都有一组价格数据。所以你需要获取该属性的值并解析json。

library(rvest)

webpage <- read_html("https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/")
tbls <- html_nodes(webpage, "table")

webpage %>%
  html_nodes("table") %>%
  .[1:(length(tbls)-2)] %>%
  html_table()


ss <- webpage %>% html_nodes("table span.price-data ") %>% xml_attr('data-amount') 

lapply(ss,function(x){data.frame(jsonlite::fromJSON(x))})

示例输出:

[[176]]
  regional.asia.pacific.southeast regional.australia.east regional.canada.central regional.canada.east
1                           1.496                   1.496                   1.376                1.376
  regional.europe.west regional.japan.east regional.united.kingdom.south regional.us.east.2 regional.usgov.virginia
1                1.488               1.464                         1.448              1.373                   1.504
  regional.us.west regional.us.west.2
1            1.376              1.248

[[177]]
  regional.asia.pacific.southeast regional.australia.east regional.canada.central regional.canada.east
1                           4.464                   4.464                   4.224                4.224
  regional.europe.west regional.japan.east regional.united.kingdom.south regional.us.east.2 regional.usgov.virginia
1                4.448                 4.4                         4.368              4.365                    4.48
  regional.us.west regional.us.west.2
1            4.224              3.968

您需要匹配该特定值并从中获取价格。