是否可以使用Selenium Web刮擦已发布的Power BI报告?

时间:2020-05-26 22:30:48

标签: python selenium powerbi

我正在尝试在已发布的Power BI报表中抓取刷新日期对象,以便可以看到上次刷新服务器上的报表的时间。为了做到这一点,我一直在尝试将Beautiful Soup和Selenium与Python结合使用。但是,实际上我没有运气来开始输出任何html代码。

我是否需要使用特定的网址或在URL中添加一些内容才能使其刮取此报告的网站?

这是我一直在使用的代码:

## install google chrome
! wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
! sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
! apt-get -y update
! apt-get install -y google-chrome-stable
#
## install chromedriver
! apt-get install -yqq unzip
! echo "y" | wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip
! echo "y" | unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/
#
## install xvfb
! apt-get install -yqq xvfb
#
## set display port and dbus env to avoid hanging  - in Environment file
! ENV DISPLAY=:99
! ENV DBUS_SESSION_BUS_ADDRESS=/dev/null
#
## install selenium
! pip install selenium==3.8.1
#
#
##Install beautifulsoup
! pip install bs4
​
#Functional Tools
import pandas as pd
import os
from pathlib import Path
import re
import pymssql
import numpy as np
import time 
from dash.dependencies import Input, Output
import csv
#from openpyxl import load_workbook
​
#API Packages
import urllib.request as urllib2
​
#web parsers
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from bs4 import BeautifulSoup
​
#URL tools
import requests
from pandas.io.json import json_normalize
from urllib.parse import urlencode
from requests.auth import HTTPBasicAuth
​  ​
#print all outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
​
​#Web Scraping Portion
# Login credentials
USER = (os.environ['USERNAME'])
PASS = (os.environ['PASSWORD'])
print (USER)
​
URL_start='http://'
URL_middle='@'
x='powerbi/reports/powerbi/path'
URL_end='/ReportName'
​
# Create a desired capabilities object as a starting point
capabilities = DesiredCapabilities.CHROME.copy()
capabilities['acceptInsecureCerts'] = True
​
#create options for webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--ignore-certificate-errors')
driver = webdriver.Chrome(desired_capabilities=capabilities,chrome_options=chrome_options)
​
#pull all text from the URL page:
#driver.implicitly_wait(10)
print(URL_start+URL_middle+x+URL_end)
driver.get(URL_start+USER+':'+PASS+URL_middle+x+URL_end) 
soup = BeautifulSoup(driver.page_source, 'html.parser')
soup

0 个答案:

没有答案