Webscraping HTML标签,包括所有链接内的标签

时间:2016-04-19 09:58:33

标签: python regex hyperlink beautifulsoup

我在使用WildSoup和Python 3.5时,我试图为所有h-tag(所有h1,h2等等)搜索网站。我的问题是让程序打开网站上的其他链接来刮掉他们的标签。

所以,让我们说我有一个带有导航菜单的网站,其中有一些链接遍及整个网站,所有链接都包含某种类型的h-tag。我如何在我选择的网站上抓取所有这些内容?

这是我目前使用的代码,只是在特定网址中抓取h1标签:

import requests
from bs4 import BeautifulSoup

url = "http://dsv.su.se/en/research"

r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")

h1_data = soup.find_all("h1")

for item in h1_data:
    print (item.contents[0])

我希望自己足够清楚。感谢。

2 个答案:

答案 0 :(得分:0)

使用您的示例网址,我们可以从HeadRowMenu获取所有网址,并使用循环从每个网页中提取所有h1。

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = "http://dsv.su.se/en"

base = "http://dsv.su.se"


def crawl(start, base):
    r = requests.get(start)
    soup = BeautifulSoup(r.content, "lxml")
    hs = ["h1", "h2", "h3", "h4", "h5", "h6"]
    menu_links = [urljoin(base, a["href"]) for a in soup.select("#HeadRowMenu a")][1:]
    for h in hs:
        yield soup.find_all(h)
    for lnk in menu_links:
        soup = BeautifulSoup(requests.get(lnk).content)
        for h in hs:
            yield soup.find_all(h)

如果我们运行它:

In [17]: print(list(chain.from_iterable(crawl(url, base))))

[<h1 class="visuallyhidden">Department of Computer and Systems Sciences</h1>, <h1>
<a href="/en/about/news/improve-your-digital-competences-on-line-with-eskills-match-1.278510">Improve your digital competences on-line with eSkills Match</a>
</h1>, <h1>
<a href="/en/about/news/envisioning-equality-in-computer-science-tomorrow-today-1.272045">Envisioning Equality in Computer Science - Tomorrow Today</a>
</h1>, <h1>
<a href="/en/about/news/egovlab-develops-online-democracy-1.271839">eGovlab develops online democracy</a>
</h1>, <h1>
<a href="/en/about/events/vinnova-and-dsv-invite-you-to-a-seminar-about-horizon-2020-1.266104">Vinnova and DSV invite you to a seminar about Horizon 2020</a>
</h1>, <h1>
<a href="/en/about/news/significant-increase-of-applicants-for-international-programmes-1.265744">Significant increase of applicants for international programmes</a>
</h1>, <h1>News</h1>, <h2>Semester start information</h2>, <h2>Meet our students</h2>, <h1 class="visuallyhidden">Education</h1>, <h1>Welcome to the education web at DSV!</h1>, <h1>Master's Programmes at DSV</h1>, <h2>
    Master's Programmes in English:</h2>, <h1 class="visuallyhidden">Research</h1>, <h1>Research highlights</h1>, <h2>Research news</h2>, <h1 class="visuallyhidden">About us</h1>, <h1>About DSV</h1>, <h2>Sweden's oldest IT department</h2>, <h2>Interdisciplinary education and research</h2>, <h2>Right in the middle of one of the world's leading ICT clusters</h2>, <h1 class="visuallyhidden">Internal</h1>, <h1>Internal</h1>, <h2>Semester start information</h2>, <h2>Meet our students</h2>]

如果你想在网站上搜索每个链接,你应该看scrapy,这不是一件容易的事情,因为你不能盲目地访问你发现的每个链接,因为它可以真正带你到任何地方并无限循环。你需要确保你只是访问你想要的域,你可以很容易地使用scrapy。看看crawlspider

滚动你自己:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin


class Crawl:
    def __init__(self, start_url, allowed, base, select):
        self.start_url = start_url
        self.base = base
        self.allowed_domain = allowed
        self.crawled = set()
        self.select = select

    def start(self):
        r = requests.get(self.start_url)
        soup = BeautifulSoup(r.content, "lxml")
        menu_links = [urljoin(self.base, a["href"]) for a in soup.select(self.select)]
        for lnk in menu_links:
            yield from self.crawl(lnk)

    def crawl(self, lnk):
        r = requests.get(lnk)
        soup = BeautifulSoup(r.content, "lxml")
        hs = ["h1", "h2", "h3", "h4", "h5", "h6"]
        page_links = (a["href"] for a in soup.select("a[href]"))
        joined = (urljoin(base, lnk) if lnk.startswith("/en/") else lnk for lnk in page_links)
        for lnk in filter(lambda link: link.startswith("http"), joined):
            if lnk not in self.crawled:
                soup = BeautifulSoup(requests.get(lnk).content,"lxml")
                for h in hs:
                    yield soup.find_all(h)
            self.crawled.add(lnk)

示例运行:

In [2]: from itertools import chain    
In [3]: url = "http://dsv.su.se/en"    
In [4]: base = "http://dsv.su.se"    
In [5]: crawler = Crawl(url, "dsv.su.se", base, "#HeadRowMenu a") 

In [6]: for h in chain.from_iterable(crawler.start()):
   ...:          print(h)
   ...:     
<h1 class="visuallyhidden">Institutionen för data- och systemvetenskap</h1>
<h1>
<a href="/omdsv/evenemang/dsv-50-%C3%A5r-digitala-aff%C3%A4rer-%C3%B6ppet-jubileumsseminarium-1.274298">*DSV 50 år* - Digitala affärer - öppet jubileumsseminarium </a>
</h1>
<h1>
<a href="/omdsv/nyheter/premi%C3%A4r-f%C3%B6r-vandringsdramat-exil-fria-poeter-p%C3%A5-flykt-1.278502">Premiär för vandringsdramat Exil - fria poeter på flykt</a>
</h1>
<h1>
<a href="/omdsv/nyheter/nu-b%C3%B6r-det-st%C3%A5-klart-att-n%C3%A5got-m%C3%A5ste-g%C3%B6ras-1.277680">Nu bör det stå klart att något måste göras </a>
</h1>
<h1>
<a href="/omdsv/nyheter/hur-enkelt-%C3%A4r-det-f%C3%B6r-fbi-att-kn%C3%A4cka-en-iphone-utan-apples-hj%C3%A4lp-1.277546">Hur enkelt är det för FBI att knäcka en Iphone utan Apples hjälp?</a>
</h1>
<h1>
<a href="/omdsv/nyheter/1-av-2-vill-l%C3%A5ta-staten-hacka-sig-in-i-datorer-1.277367">Svårt att backa tillbaka från ökad övervakning</a>
</h1>
<h1>Senaste nyheterna</h1>
<h2 class="category">Kommande evenemang</h2>
<h2>Information inför terminsstart</h2>
<h1 class="visuallyhidden">Other languages</h1>
<h1>Other languages</h1>
<h2>
    Information in Chinese and Russian</h2>
<h2>Contact The Administration of Studies</h2>
<h1 class="visuallyhidden">Department of Computer and Systems Sciences</h1>
<h1>
<a href="/en/about/news/improve-your-digital-competences-on-line-with-eskills-match-1.278510">Improve your digital competences on-line with eSkills Match</a>
</h1>
<h1>
<a href="/en/about/news/envisioning-equality-in-computer-science-tomorrow-today-1.272045">Envisioning Equality in Computer Science - Tomorrow Today</a>
</h1>
<h1>
<a href="/en/about/news/egovlab-develops-online-democracy-1.271839">eGovlab develops online democracy</a>
</h1>
<h1>
<a href="/en/about/events/vinnova-and-dsv-invite-you-to-a-seminar-about-horizon-2020-1.266104">Vinnova and DSV invite you to a seminar about Horizon 2020</a>
</h1>
<h1>
<a href="/en/about/news/significant-increase-of-applicants-for-international-programmes-1.265744">Significant increase of applicants for international programmes</a>
</h1>
<h1>News</h1>
<h2>Semester start information</h2>
<h2>Meet our students</h2>
...................................

显然,如果你想要更深入,你需要添加更多逻辑,将所有链接存储在一个结构中并循环直到它为空。类似的东西:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from time import sleep

class Crawl:
    def __init__(self, start_url, allowed, base, select):
        self.start_url = start_url
        self.base = base
        self.allowed_domain = allowed
        self.crawled = set()
        self.select = select
        self.urls = set()

    def start(self):
        r = requests.get(self.start_url)
        soup = BeautifulSoup(r.content, "lxml")
        menu_links = [urljoin(self.base, a["href"]) for a in soup.select(self.select)]
        print(menu_links)
        for lnk in menu_links:
            yield from self.crawl(lnk)

    def filter_urls(self, soup):
        page_links = [a["href"] for a in soup.select("a[href]")]
        joined = (urljoin(base, lnk) if lnk.startswith("/en/") else lnk for lnk in page_links)
        return set(filter(lambda lnk: self.allowed_domain in lnk, joined))


    def crawl(self, lnk):
        r = requests.get(lnk)
        soup = BeautifulSoup(r.content, "lxml")
        hs = ["h1", "h2", "h3", "h4", "h5", "h6"]
        self.urls.update(self.filter_urls(soup))
        while self.urls:
            nxt = self.urls.pop()
            if nxt not in self.crawled:
                try:
                    soup = BeautifulSoup(requests.get(nxt).content, "lxml")
                except requests.exceptions.RequestException as e:
                    print(e.strerror)
                    self.crawled.add(nxt)
                    continue
                self.urls.update((self.filter_urls(soup) - self.crawled))
                for h in hs:
                    yield soup.find_all(h)
            self.crawled.add(nxt)
            sleep(.1)

这将访问网址中包含dsv.su.se的每个链接,但要注意有很多链接要刮,所以请准备等待一段时间。

答案 1 :(得分:-1)

这是一个演示版本(未经测试),可以完成您所描述的内容。 基本上,您将已发现的URL添加到队列中 并继续,直到所有链接都被抓取

import requests
from bs4 import BeautifulSoup
from collections import deque

seen = set()
queue = deque(['http://dsv.su.se/en/research'])
while len(queue):
    url = queue.popleft()
    if url not in seen:
        r = requests.get(url)
        soup = BeautifulSoup(r.content, "html.parser")
        h1_data = soup.find_all("h1")
        for item in h1_data:
            u = item.contents[0]
            queue.append(u)
        seen.add(url)

我认为你需要类似的东西。