我想从网站获取所有超链接,其网址文字包含product
service
solution
index
所以我提出了这个
site = 'https://www.similarweb.com'
resp = requests.get(site)
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
soup = BeautifulSoup(resp.content, from_encoding=encoding)
contact_links = []
for a in soup.find_all('a', href=True):
if 'product' in a['href'] or 'service' in a['href'] or 'solution' in a['href'] or 'about' in a['href'] or 'index' in a['href']:
contact_links.append(a['href'])
contact_links2 = []
for i in contact_links:
string2 = i
if string2[:4] == 'http':
contact_links2.append(i)
else:
contact_links2.append(site+i)
for i in contact_links2:
print i
在https://www.similarweb.com上运行此代码段时,它会提供一些链接,其中一些是
https://www.similarweb.com/apps/top/google/app-index/us/all/top-free
https://www.similarweb.com/corp/solution/travel/
https://www.similarweb.com/corp/about/
http://www.thedailybeast.com/articles/2016/10/17/drudge-limbaugh-fall-for-twitter-joke-about-postal-worker-destroying-trump-ballots.html
https://www.similarweb.com/apps/top/google/app-index/us/all/top-free
根据此结果,我只想要那些在product
service
solution
index
之后不再有任何单词的链接
预期产量: (仅考虑前5个链接)
https://www.similarweb.com/corp/about/
我该怎么做?
答案 0 :(得分:1)
如果条件允许,您应该在签到的单词之前和之后使用反斜杠。它应该是if '/product/' in a['href'] ...
等等。
正如评论中提到的那样,它应该是最后一个字,那么最好检查一下a['href'].endswith('/product/')
。
因为endswith函数可以将元组作为参数,所以你可以这样做
if a['href'].endswith(('/product/', '/index/', '/about/', '/solution/', 'service'))
。
对于以元组中提到的任何字符串结尾的所有URL,此条件将评估为true。
答案 1 :(得分:0)
import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin
r = requests.get('https://www.similarweb.com/')
soup = BeautifulSoup(r.text, 'lxml')
urls = set()
for i in soup.find_all('a', href=re.compile(r'((about)|(product)|(service)|(solution)|(index))/$')):
url = i.get('href')
abs_url = urljoin(r.url, url)
urls.add(abs_url)
print(urls)