我正尝试在SF纪事中获得此类别中每篇文章的链接,但我不确定应该从何处提取URL。到目前为止,这是我的进度:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.sfchronicle.com/local/'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
zone2_container = page_soup.findAll("div",{"class":"zone zone-2"})
zone3_container = page_soup.findAll("div",{"class":"zone zone-3"})
zone4_container = page_soup.findAll("div",{"class":"zone zone-4"})
right_rail_container = page_soup.findAll("div",{"class":"right-rail"})
我想要的所有链接都位于zone2-4_container和right_rail_container中。
答案 0 :(得分:0)
您可以使用以下代码获取所有链接:
all_zones = [zone2_container, zone3_container, zone4_container, right_rail_container]
urls = []
for i in all_zones:
links = i[0].findAll('a')
for link in links:
urls.append(link['href'])
我已经将所有列表合并到一个列表中,但是您也可以定义一个函数来实现相同的功能。
def get_urls(zone):
urls = []
for i in zone:
links = i.findAll('a')
for link in links:
urls.append(link['href'])
return urls
get_urls(zone2_container)
答案 1 :(得分:0)
现在看来,您基本上希望获得所有文章链接,在这种情况下,您可以使用具有contains运算符的attribute = value css选择器来定位其值包含子字符串'href
'的article
属性
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
base = 'https://www.sfchronicle.com/'
url = 'https://www.sfchronicle.com/local/'
res = requests.get(url)
soup = bs(res.content, 'lxml')
links = [urljoin(base,link['href']) for link in soup.select('[href*=article]')]
print(links)
print(len(links))