我在使用WildSoup和Python 3.5时,我试图为所有h-tag(所有h1,h2等等)搜索网站。我的问题是让程序打开网站上的其他链接来刮掉他们的标签。
所以,让我们说我有一个带有导航菜单的网站,其中有一些链接遍及整个网站,所有链接都包含某种类型的h-tag。我如何在我选择的网站上抓取所有这些内容?
这是我目前使用的代码,只是在特定网址中抓取h1标签:
import requests
from bs4 import BeautifulSoup
url = "http://dsv.su.se/en/research"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
h1_data = soup.find_all("h1")
for item in h1_data:
print (item.contents[0])
我希望自己足够清楚。感谢。
答案 0 :(得分:0)
使用您的示例网址,我们可以从HeadRowMenu
获取所有网址,并使用循环从每个网页中提取所有h1。
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "http://dsv.su.se/en"
base = "http://dsv.su.se"
def crawl(start, base):
r = requests.get(start)
soup = BeautifulSoup(r.content, "lxml")
hs = ["h1", "h2", "h3", "h4", "h5", "h6"]
menu_links = [urljoin(base, a["href"]) for a in soup.select("#HeadRowMenu a")][1:]
for h in hs:
yield soup.find_all(h)
for lnk in menu_links:
soup = BeautifulSoup(requests.get(lnk).content)
for h in hs:
yield soup.find_all(h)
如果我们运行它:
In [17]: print(list(chain.from_iterable(crawl(url, base))))
[<h1 class="visuallyhidden">Department of Computer and Systems Sciences</h1>, <h1>
<a href="/en/about/news/improve-your-digital-competences-on-line-with-eskills-match-1.278510">Improve your digital competences on-line with eSkills Match</a>
</h1>, <h1>
<a href="/en/about/news/envisioning-equality-in-computer-science-tomorrow-today-1.272045">Envisioning Equality in Computer Science - Tomorrow Today</a>
</h1>, <h1>
<a href="/en/about/news/egovlab-develops-online-democracy-1.271839">eGovlab develops online democracy</a>
</h1>, <h1>
<a href="/en/about/events/vinnova-and-dsv-invite-you-to-a-seminar-about-horizon-2020-1.266104">Vinnova and DSV invite you to a seminar about Horizon 2020</a>
</h1>, <h1>
<a href="/en/about/news/significant-increase-of-applicants-for-international-programmes-1.265744">Significant increase of applicants for international programmes</a>
</h1>, <h1>News</h1>, <h2>Semester start information</h2>, <h2>Meet our students</h2>, <h1 class="visuallyhidden">Education</h1>, <h1>Welcome to the education web at DSV!</h1>, <h1>Master's Programmes at DSV</h1>, <h2>
Master's Programmes in English:</h2>, <h1 class="visuallyhidden">Research</h1>, <h1>Research highlights</h1>, <h2>Research news</h2>, <h1 class="visuallyhidden">About us</h1>, <h1>About DSV</h1>, <h2>Sweden's oldest IT department</h2>, <h2>Interdisciplinary education and research</h2>, <h2>Right in the middle of one of the world's leading ICT clusters</h2>, <h1 class="visuallyhidden">Internal</h1>, <h1>Internal</h1>, <h2>Semester start information</h2>, <h2>Meet our students</h2>]
如果你想在网站上搜索每个链接,你应该看scrapy,这不是一件容易的事情,因为你不能盲目地访问你发现的每个链接,因为它可以真正带你到任何地方并无限循环。你需要确保你只是访问你想要的域,你可以很容易地使用scrapy。看看crawlspider。
滚动你自己:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
class Crawl:
def __init__(self, start_url, allowed, base, select):
self.start_url = start_url
self.base = base
self.allowed_domain = allowed
self.crawled = set()
self.select = select
def start(self):
r = requests.get(self.start_url)
soup = BeautifulSoup(r.content, "lxml")
menu_links = [urljoin(self.base, a["href"]) for a in soup.select(self.select)]
for lnk in menu_links:
yield from self.crawl(lnk)
def crawl(self, lnk):
r = requests.get(lnk)
soup = BeautifulSoup(r.content, "lxml")
hs = ["h1", "h2", "h3", "h4", "h5", "h6"]
page_links = (a["href"] for a in soup.select("a[href]"))
joined = (urljoin(base, lnk) if lnk.startswith("/en/") else lnk for lnk in page_links)
for lnk in filter(lambda link: link.startswith("http"), joined):
if lnk not in self.crawled:
soup = BeautifulSoup(requests.get(lnk).content,"lxml")
for h in hs:
yield soup.find_all(h)
self.crawled.add(lnk)
示例运行:
In [2]: from itertools import chain
In [3]: url = "http://dsv.su.se/en"
In [4]: base = "http://dsv.su.se"
In [5]: crawler = Crawl(url, "dsv.su.se", base, "#HeadRowMenu a")
In [6]: for h in chain.from_iterable(crawler.start()):
...: print(h)
...:
<h1 class="visuallyhidden">Institutionen för data- och systemvetenskap</h1>
<h1>
<a href="/omdsv/evenemang/dsv-50-%C3%A5r-digitala-aff%C3%A4rer-%C3%B6ppet-jubileumsseminarium-1.274298">*DSV 50 år* - Digitala affärer - öppet jubileumsseminarium </a>
</h1>
<h1>
<a href="/omdsv/nyheter/premi%C3%A4r-f%C3%B6r-vandringsdramat-exil-fria-poeter-p%C3%A5-flykt-1.278502">Premiär för vandringsdramat Exil - fria poeter på flykt</a>
</h1>
<h1>
<a href="/omdsv/nyheter/nu-b%C3%B6r-det-st%C3%A5-klart-att-n%C3%A5got-m%C3%A5ste-g%C3%B6ras-1.277680">Nu bör det stå klart att något måste göras </a>
</h1>
<h1>
<a href="/omdsv/nyheter/hur-enkelt-%C3%A4r-det-f%C3%B6r-fbi-att-kn%C3%A4cka-en-iphone-utan-apples-hj%C3%A4lp-1.277546">Hur enkelt är det för FBI att knäcka en Iphone utan Apples hjälp?</a>
</h1>
<h1>
<a href="/omdsv/nyheter/1-av-2-vill-l%C3%A5ta-staten-hacka-sig-in-i-datorer-1.277367">Svårt att backa tillbaka från ökad övervakning</a>
</h1>
<h1>Senaste nyheterna</h1>
<h2 class="category">Kommande evenemang</h2>
<h2>Information inför terminsstart</h2>
<h1 class="visuallyhidden">Other languages</h1>
<h1>Other languages</h1>
<h2>
Information in Chinese and Russian</h2>
<h2>Contact The Administration of Studies</h2>
<h1 class="visuallyhidden">Department of Computer and Systems Sciences</h1>
<h1>
<a href="/en/about/news/improve-your-digital-competences-on-line-with-eskills-match-1.278510">Improve your digital competences on-line with eSkills Match</a>
</h1>
<h1>
<a href="/en/about/news/envisioning-equality-in-computer-science-tomorrow-today-1.272045">Envisioning Equality in Computer Science - Tomorrow Today</a>
</h1>
<h1>
<a href="/en/about/news/egovlab-develops-online-democracy-1.271839">eGovlab develops online democracy</a>
</h1>
<h1>
<a href="/en/about/events/vinnova-and-dsv-invite-you-to-a-seminar-about-horizon-2020-1.266104">Vinnova and DSV invite you to a seminar about Horizon 2020</a>
</h1>
<h1>
<a href="/en/about/news/significant-increase-of-applicants-for-international-programmes-1.265744">Significant increase of applicants for international programmes</a>
</h1>
<h1>News</h1>
<h2>Semester start information</h2>
<h2>Meet our students</h2>
...................................
显然,如果你想要更深入,你需要添加更多逻辑,将所有链接存储在一个结构中并循环直到它为空。类似的东西:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from time import sleep
class Crawl:
def __init__(self, start_url, allowed, base, select):
self.start_url = start_url
self.base = base
self.allowed_domain = allowed
self.crawled = set()
self.select = select
self.urls = set()
def start(self):
r = requests.get(self.start_url)
soup = BeautifulSoup(r.content, "lxml")
menu_links = [urljoin(self.base, a["href"]) for a in soup.select(self.select)]
print(menu_links)
for lnk in menu_links:
yield from self.crawl(lnk)
def filter_urls(self, soup):
page_links = [a["href"] for a in soup.select("a[href]")]
joined = (urljoin(base, lnk) if lnk.startswith("/en/") else lnk for lnk in page_links)
return set(filter(lambda lnk: self.allowed_domain in lnk, joined))
def crawl(self, lnk):
r = requests.get(lnk)
soup = BeautifulSoup(r.content, "lxml")
hs = ["h1", "h2", "h3", "h4", "h5", "h6"]
self.urls.update(self.filter_urls(soup))
while self.urls:
nxt = self.urls.pop()
if nxt not in self.crawled:
try:
soup = BeautifulSoup(requests.get(nxt).content, "lxml")
except requests.exceptions.RequestException as e:
print(e.strerror)
self.crawled.add(nxt)
continue
self.urls.update((self.filter_urls(soup) - self.crawled))
for h in hs:
yield soup.find_all(h)
self.crawled.add(nxt)
sleep(.1)
这将访问网址中包含dsv.su.se
的每个链接,但要注意有很多链接要刮,所以请准备等待一段时间。
答案 1 :(得分:-1)
这是一个演示版本(未经测试),可以完成您所描述的内容。 基本上,您将已发现的URL添加到队列中 并继续,直到所有链接都被抓取
import requests
from bs4 import BeautifulSoup
from collections import deque
seen = set()
queue = deque(['http://dsv.su.se/en/research'])
while len(queue):
url = queue.popleft()
if url not in seen:
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
h1_data = soup.find_all("h1")
for item in h1_data:
u = item.contents[0]
queue.append(u)
seen.add(url)
我认为你需要类似的东西。