我正在尝试从URL获取所有href。问题是我无法提取写的href:
<a href="#!DetalleNorma/203906/20190322" title="" data-bind="html: organismo, attr: {href: $root.crearHrefDetalleNorma(idTamite,fechaPublicacion)} ">SECRETARÍA GENERAL</a>
我只能提取的是:#!
from bs4 import BeautifulSoup
import urllib.request as urllib2
import re
html_page = urllib2.urlopen('https://www.boletinoficial.gob.ar/')
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
print link.get('href')
这里是解析。它也不起作用:
import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get('https://www.boletinoficial.gob.ar/')
soup = BeautifulSoup(r.content, "html.parser")
for td in soup.findAll("div", class_="itemsection"):
for a in td.findAll("a", href=True):
print(a.text)
答案 0 :(得分:1)
我不得不在等待条件下使用硒
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://www.boletinoficial.gob.ar/')
links = [item.get_attribute('href') for item in WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".itemsection [href]")))]
print(links)
文本和链接作为元组
data = [(item.get_attribute('href'), item.text) for item in WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".itemsection [href]")))]
print(data)