我正在尝试提取“ a href =“ link” ...“下的链接
由于存在多行,因此我对每个行进行迭代。每行的第一个链接是我需要的链接,因此我使用find_all('tr')和find('a')。 我知道find('a')返回一个Nonetype,但是不知道如何解决这个问题
我有一段有效的代码,但是效率很低(在注释中)。
sauce = urllib.request.urlopen('https://morocco.observation.org/soortenlijst_wg_v3.php')
soup = bs.BeautifulSoup(sauce, 'lxml')
tabel = soup.find('table', {'class': 'tablesorter'})
for i in tabel.find_all('tr'):
# if 'view' in i.get('href'):
# link_list.append(i.get('href'))
link = i.find('a')
#<a class="z1" href="/soort/view/164?from=1987-12-05&to=2019-05-31">Common Reed Bunting - <em>Emberiza schoeniclus</em></a>
如何检索href下的链接并解决Nonetype仅从/ soort / view / 164?from = 1987-12-05&to = 2019-05-31开始的问题?
预先感谢
答案 0 :(得分:0)
link = i.find('a')
_href = link['href']
print(_href)
O / P:
"/soort/view/164?from=1987-12-05&to=2019-05-31?"
这不是正确的网址链接,您应将其与域名连接
new_url = "https://morocco.observation.org"+_href
print(new_url)
O / p:
https://morocco.observation.org/soort/view/164?from=1987-12-05&to=2019-05-31?
更新:
from bs4 import BeautifulSoup
from bs4.element import Tag
import requests
resp = requests.get("https://morocco.observation.org/soortenlijst_wg_v3.php")
soup = BeautifulSoup(resp.text, 'lxml')
tabel = soup.find('table', {'class': 'tablesorter'})
base_url = "https://morocco.observation.org"
for i in tabel.find_all('tr'):
link = i.find('a',href=True)
if link is None or not isinstance(link,Tag):
continue
url = base_url + link['href']
print(url)
O / P:
https://morocco.observation.org/soort/view/248?from=1975-05-05&to=2019-06-01
https://morocco.observation.org/soort/view/174?from=1989-12-15&to=2019-06-01
https://morocco.observation.org/soort/view/57?from=1975-05-05&to=2019-06-01
https://morocco.observation.org/soort/view/19278?from=1975-05-13&to=2019-06-01
https://morocco.observation.org/soort/view/56?from=1993-03-25&to=2019-06-01
https://morocco.observation.org/soort/view/1504?from=1979-05-25&to=2019-06-01
https://morocco.observation.org/soort/view/78394?from=1975-05-09&to=2019-06-01
https://morocco.observation.org/soort/view/164?from=1987-12-05&to=2019-06-01
答案 1 :(得分:0)
一种逻辑方法是使用nth-of-type隔离目标列
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://morocco.observation.org/soortenlijst_wg_v3.php')
soup = bs(r.content, 'lxml')
base = 'https://morocco.observation.org'
urls = [base + item['href'] for item in soup.select('#mytable_S td:nth-of-type(3) a')]
您还可以传递课程列表
urls = [base + item['href'] for item in soup.select('.z1, .z2,.z3,.z4')]
或者甚至将class
的运算符以^开头
urls = [base + item['href'] for item in soup.select('[class^=z]')]
或者包含href
的*运算符
urls = [base + item['href'] for item in soup.select('[href*=view]')]
在此处了解不同的CSS选择器方法:https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors