我一直在尝试使用R刮擦我当地政府的Power BI仪表板,但似乎不可能。我从Microsoft网站上了解到,无法对Power BI仪表板进行加密,但是我正在浏览多个论坛,以表明有可能,但是我正在经历一个循环
我正在尝试从此信息中心抓取Zip Code
标签数据:
我从下面的给定代码中尝试了几种“技术”
scc_webpage <- xml2::read_html("https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2")
# Attempt using xpath
scc_webpage %>%
rvest::html_nodes(xpath = '//*[@id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]') %>%
rvest::html_text()
# Attempt using div.<class>
scc_webpage %>%
rvest::html_nodes("div.pivotTableCellWrap cell-interactive tablixAlignRight ") %>%
rvest::html_text()
# Attempt using xpathSapply
query = '//*[@id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]'
XML::xpathSApply(xml, query, xmlValue)
scc_webpage %>%
html_nodes("ui-view")
但是,在使用xpath并获取character(0)
类和id时,我总是得到一个输出div
,或者在尝试通过{xml_nodeset (0)}
时甚至得到html_nodes
。奇怪的是,当我这样做时,它不会显示表格数据的整个html:
scc_webpage %>%
html_nodes("div")
这将是输出,将我需要的块留空:
{xml_nodeset (2)}
[1] <div id="pbi-loading"><svg version="1.1" class="pulsing-svg-item" xmlns="http://www.w3.org/2000/svg" xmlns:xlink ...
[2] <div id="pbiAppPlaceHolder">\r\n <ui-view></ui-view><root></root>\n</div>
我猜可能是因为数字在一系列嵌套的div
属性内?
我要获取的主要数据是表格中显示Zip code
,confirmed cases
,% total cases
,deaths
,% total deaths
的数字。 / p>
如果这可以在R或使用Selenium的Python中完成,那么对此的任何帮助将不胜感激!
答案 0 :(得分:1)
问题是您要分析的站点依赖JavaScript运行并为您获取内容。在这种情况下,httr::GET
对您没有帮助。
但是,由于也不是手动工作,因此我们提供了硒。
以下内容可满足您的需求:
library(dplyr)
library(purrr)
library(readr)
library(wdman)
library(RSelenium)
library(xml2)
# using wdman to start a selenium server
selServ <- selenium(
port = 4444L,
version = 'latest',
chromever = '84.0.4147.30', # set this to a chrome version that's available on your machine
)
# using RSelenium to start a chrome on the selenium server
remDr <- remoteDriver(
remoteServerAddr = 'localhost',
port = 4444L,
browserName = 'chrome'
)
# open a new Tag on Chrome
remDr$open()
# navigate to the site you wish to analyze
report_url <- "https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2"
remDr$navigate(report_url)
# find and click the button leading to the Zip Code data
zipCodeBtn <- remDr$findElement('.//button[descendant::span[text()="Zip Code"]]', using="xpath")
zipCodeBtn$clickElement()
# fetch the site source in XML
zipcode_data_table <- read_html(remDr$getPageSource()[[1]]) %>%
querySelector("div.pivotTable")
现在,我们已将页面源读入R,这可能是您开始抓取任务时所想到的。
从这里开始,一切顺利,仅涉及将xml转换为可用表:
col_headers <- zipcode_data_table %>%
querySelectorAll("div.columnHeaders div.pivotTableCellWrap") %>%
map_chr(xml_text)
rownames <- zipcode_data_table %>%
querySelectorAll("div.rowHeaders div.pivotTableCellWrap") %>%
map_chr(xml_text)
zipcode_data <- zipcode_data_table %>%
querySelectorAll("div.bodyCells div.pivotTableCellWrap") %>%
map(xml_parent) %>%
unique() %>%
map(~ .x %>% querySelectorAll("div.pivotTableCellWrap") %>% map_chr(xml_text)) %>%
setNames(col_headers) %>%
bind_cols()
# tadaa
df_final <- tibble(zipcode = rownames, zipcode_data) %>%
type_convert(trim_ws = T, na = c(""))
生成的df如下:
> df_final
# A tibble: 15 x 5
zipcode `Confirmed Cases ` `% of Total Cases ` `Deaths ` `% of Total Deaths `
<chr> <dbl> <chr> <dbl> <chr>
1 63301 1549 17.53% 40 28.99%
2 63366 1364 15.44% 38 27.54%
3 63303 1160 13.13% 21 15.22%
4 63385 1091 12.35% 12 8.70%
5 63304 1046 11.84% 3 2.17%
6 63368 896 10.14% 12 8.70%
7 63367 882 9.98% 9 6.52%
8 534 6.04% 1 0.72%
9 63348 105 1.19% 0 0.00%
10 63341 84 0.95% 1 0.72%
11 63332 64 0.72% 0 0.00%
12 63373 25 0.28% 1 0.72%
13 63386 17 0.19% 0 0.00%
14 63357 13 0.15% 0 0.00%
15 63376 5 0.06% 0 0.00%