我无法让BeautifulSoup从以下网站解析整个代码:https://www.bcb.gov.br/
我想要的值在<app-root> ... <\app-root>
之间,但是当我使用以下代码时,app-root
标记内的内容没有被解析:
import urllib.request as urllib2
from bs4 import BeautifulSoup as bs
html = 'https://www.bcb.gov.br'
page = urllib2.urlopen(html)
soup = bs(page, 'html.parser')
print(soup)
结果是:
<!DOCTYPE doctype html>
<html lang="en"><head><meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<script>document.head.innerHTML += "<base href='" + window.location.protocol
+ "//" + window.location.host +"/" + "'>"</script><meta charset="utf-8"/>
<title>Banco Central do Brasil</title><meta content="width=device-
width,initial-scale=1" name="viewport"/><link href="favicon.ico" rel="icon"
type="image/x-icon"/><link href="https://fonts.googleapis.com/css?
family=Cormorant+Garamond:300,300i,400,400i,500,500i,600,600i,700,700i|
Ubuntu:300,300i,400,400i,500,500i,700,700i" rel="stylesheet"/><script
src="assets/js/config.js"></script><link
href="styles.ad070d90de458f2251ec.bundle.css" rel="stylesheet"/></head>
<body><app-root></app-root><!-- Global site tag (gtag.js) - Google Analytics
--><script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-
65460906-3"></script><script>window.dataLayer = window.dataLayer || [];
function gtag() { dataLayer.push(arguments); }
gtag('js', new Date());
gtag('config', 'UA-65460906-3');</script><script
src="inline.b9c96f03aa7f6b76c42d.bundle.js?v=5" type="text/javascript">
</script><script src="polyfills.a7b9da535b3a5a6fbe04.bundle.js?v=5"
type="text/javascript"></script><script
src="scripts.b27f0359c1c3f740a0de.bundle.js?v=5" type="text/javascript">
</script><script src="vendor.3d7ec463120170ac4b21.bundle.js?v=5"
type="text/javascript"></script><script
src="main.36b8710c7447c7df695a.bundle.js?v=5" type="text/javascript">
</script></body></html>
您可以在<app-root></app-root>
之前看到标签...Global site tag...
,而无需显示其中的内容。这就是为什么我无法抓取我想要的值的原因。
有人可以帮我吗?
答案 0 :(得分:0)
因为它由服务器提供服务为空。只需查看页面的纯文本即可。
for line in page:
print(line)
答案 1 :(得分:0)
在抓取html之前,必须先让页面呈现。
您可以通过使用Selenium或Requests-HTML来凋谢
这是硒的一个例子:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
url = 'https://www.bcb.gov.br'
driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
driver.get(url)
soup = bs(driver.page_source, 'html.parser')