我在使用请求和 lxml 库在 Python 中进行网页抓取时遇到问题。
我需要从网站 (http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm) 中获取黄色信息。但是,这将返回:[]
请问,有人可以帮我吗?
发送下面的代码
from lxml import html
import requests
page = requests.get('http://www.b3.com.br/pt_br/market-data-e-indices/indices/indices-amplos/indice-ibovespa-ibovespa-composicao-da-carteira.htm')
tree = html.fromstring(page.content)
cod = tree.xpath('//*[@id="divContainerIframeB3"]/div/div[1]/form/div[2]/div/table/tbody/tr[1]/td[1]')
print('The code is : ', cod)
答案 0 :(得分:1)
数据是通过 Javascript 从外部来源加载的。您可以使用此脚本加载 Json 数据:
import json
import base64
import requests
api_url = "https://sistemaswebb3-listados.b3.com.br/indexProxy/indexCall/GetPortfolioDay/{encoded_string}"
page = 1
index = "IBOV"
s = {
"language": "pt-br",
"pageNumber": page,
"pageSize": 20,
"index": index,
"segment": "1",
}
encoded_string = base64.b64encode(str(s).encode("utf-8")).decode("utf-8")
data = requests.get(
api_url.format(encoded_string=encoded_string),
verify=False,
).json()
# uncomment this to get all data:
# print(json.dumps(data, indent=4))
for result in data["results"]:
print(
"{:<8} {:<15} {:15}".format(
result["cod"], result["asset"], result["theoricalQty"]
)
)
打印:
ABEV3 AMBEV S/A 4.355.174.839
ASAI3 ASSAI 157.635.935
AZUL4 AZUL 327.283.207
BTOW3 B2W DIGITAL 201.549.295
B3SA3 B3 1.930.877.944
BBSE3 BBSEGURIDADE 671.584.841
BRML3 BR MALLS PAR 843.728.684
BBDC3 BRADESCO 1.261.986.269
BBDC4 BRADESCO 4.687.814.597
BRAP4 BRADESPAR 222.075.664
BBAS3 BRASIL 1.283.197.221
BRKM5 BRASKEM 264.640.575
BRFS3 BRF SA 811.759.800
BPAC11 BTGP BANCO 263.871.572
CRFB3 CARREFOUR BR 391.758.726
CCRO3 CCR SA 1.115.695.556
CMIG4 CEMIG 969.723.092
HGTX3 CIA HERING 126.186.408
CIEL3 CIELO 1.112.196.638
COGN3 COGNA ON 1.847.994.874