使用R刮擦网站的Power BI仪表板

时间:2020-10-09 19:37:18

标签: html r web-scraping

我一直在尝试使用R刮擦我当地政府的Power BI仪表板,但似乎不可能。我从Microsoft网站上了解到,无法对Power BI仪表板进行加密,但是我正在浏览多个论坛,以表明有可能,但是我正在经历一个循环

我正在尝试从此信息中心抓取Zip Code标签数据:

https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2

我从下面的给定代码中尝试了几种“技术”

scc_webpage <- xml2::read_html("https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2")


# Attempt using xpath
scc_webpage %>% 
  rvest::html_nodes(xpath = '//*[@id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]') %>% 
  rvest::html_text()

# Attempt using div.<class>
scc_webpage %>% 
  rvest::html_nodes("div.pivotTableCellWrap cell-interactive tablixAlignRight ") %>% 
  rvest::html_text()

# Attempt using xpathSapply
query = '//*[@id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]'
XML::xpathSApply(xml, query, xmlValue)

scc_webpage %>% 
  html_nodes("ui-view")

但是,在使用xpath并获取character(0)类和id时,我总是得到一个输出div,或者在尝试通过{xml_nodeset (0)}时甚至得到html_nodes。奇怪的是,当我这样做时,它不会显示表格数据的整个html:

scc_webpage %>% 
  html_nodes("div")

这将是输出,将我需要的块留空:

{xml_nodeset (2)}
[1] <div id="pbi-loading"><svg version="1.1" class="pulsing-svg-item" xmlns="http://www.w3.org/2000/svg" xmlns:xlink ...
[2] <div id="pbiAppPlaceHolder">\r\n        <ui-view></ui-view><root></root>\n</div>

我猜可能是因为数字在一系列嵌套的div属性内?

我要获取的主要数据是表格中显示Zip codeconfirmed cases% total casesdeaths% total deaths的数字。 / p>

如果这可以在R或使用Selenium的Python中完成,那么对此的任何帮助将不胜感激!

1 个答案:

答案 0 :(得分:1)

问题是您要分析的站点依赖JavaScript运行并为您获取内容。在这种情况下,httr::GET对您没有帮助。
但是,由于也不是手动工作,因此我们提供了硒。

以下内容可满足您的需求:

library(dplyr)
library(purrr)
library(readr)

library(wdman)
library(RSelenium)
library(xml2)

# using wdman to start a selenium server
selServ <- selenium(
  port = 4444L,
  version = 'latest',
  chromever = '84.0.4147.30', # set this to a chrome version that's available on your machine
)

# using RSelenium to start a chrome on the selenium server
remDr <- remoteDriver(
  remoteServerAddr = 'localhost',
  port = 4444L,
  browserName = 'chrome'
)

# open a new Tag on Chrome
remDr$open()

# navigate to the site you wish to analyze
report_url <- "https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2"
remDr$navigate(report_url)

# find and click the button leading to the Zip Code data
zipCodeBtn <- remDr$findElement('.//button[descendant::span[text()="Zip Code"]]', using="xpath")
zipCodeBtn$clickElement()

# fetch the site source in XML
zipcode_data_table <- read_html(remDr$getPageSource()[[1]]) %>%
  querySelector("div.pivotTable")

现在,我们已将页面源读入R,这可能是您开始抓取任务时所想到的。
从这里开始,一切顺利,仅涉及将xml转换为可用表:

col_headers <- zipcode_data_table %>%
  querySelectorAll("div.columnHeaders div.pivotTableCellWrap") %>%
  map_chr(xml_text)

rownames <- zipcode_data_table %>%
  querySelectorAll("div.rowHeaders div.pivotTableCellWrap") %>%
  map_chr(xml_text)

zipcode_data <- zipcode_data_table %>%
  querySelectorAll("div.bodyCells div.pivotTableCellWrap") %>%
  map(xml_parent) %>%
  unique() %>%
  map(~ .x %>% querySelectorAll("div.pivotTableCellWrap") %>% map_chr(xml_text)) %>%
  setNames(col_headers) %>%
  bind_cols()

# tadaa
df_final <- tibble(zipcode = rownames, zipcode_data) %>%
  type_convert(trim_ws = T, na = c(""))

生成的df如下:

> df_final
# A tibble: 15 x 5
   zipcode `Confirmed Cases ` `% of Total Cases ` `Deaths ` `% of Total Deaths `
   <chr>                <dbl> <chr>                   <dbl> <chr>               
 1 63301                 1549 17.53%                     40 28.99%              
 2 63366                 1364 15.44%                     38 27.54%              
 3 63303                 1160 13.13%                     21 15.22%              
 4 63385                 1091 12.35%                     12 8.70%               
 5 63304                 1046 11.84%                      3 2.17%               
 6 63368                  896 10.14%                     12 8.70%               
 7 63367                  882 9.98%                       9 6.52%               
 8                        534 6.04%                       1 0.72%               
 9 63348                  105 1.19%                       0 0.00%               
10 63341                   84 0.95%                       1 0.72%               
11 63332                   64 0.72%                       0 0.00%               
12 63373                   25 0.28%                       1 0.72%               
13 63386                   17 0.19%                       0 0.00%               
14 63357                   13 0.15%                       0 0.00%               
15 63376                    5 0.06%                       0 0.00%