递归搜索网站以获取与httplib2和BeautifulSoup

时间:2017-10-02 17:12:25

标签: python-2.7 beautifulsoup httplib2

我正在使用以下内容从网页获取所有外部Javascript引用。如何修改代码不仅要搜索网址,还要搜索网站的所有页面?

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('https://stackoverflow.com')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
    if link.has_key('src'):
        if 'http' in link['src']:
            print link['src']

首先尝试让它刮下两页深。关于如何让它返回唯一的网​​址的任何建议?原样,大多数都是重复的。 (请注意,所有内部链接在我需要运行此网站的网站上都包含“index”一词。)

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

site = 'http://www.stackoverflow.com/'
http = httplib2.Http()
status, response = http.request(site)

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_key('href'):
        if 'index' in link['href']:
            page = site + link['href']
            status, response = http.request(page)

            for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
                if link.has_key('src'):
                    if 'http' in link['src']:
                        print "script" + " " + link['src']
            for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
                print "iframe" + " " + iframe['src']

            for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
                if link.has_key('href'):
                    if 'index' in link['href']:
                        page = site + link['href']
                        status, response = http.request(page)

                        for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
                            if link.has_key('src'):
                                if 'http' in link['src']:
                                    print "script" + " " + link['src']
                        for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
                            print "iframe" + " " + iframe['src']

1 个答案:

答案 0 :(得分:0)

抓取网站是一个广泛的主题。决定如何索引内容并进一步深入网站。它包括内容解析,就像您的基本爬虫或蜘蛛正在做的那样。编写类似于Google Bot的卓越机器人绝对不是一件容易的事。专业爬行机器人做了很多工作,其中可能包括

  • 监控与域名相关的更改以启动抓取
  • 安排站点地图查找
  • 获取网页内容(这个问题的范围)
  • 获取进一步抓取的链接集
  • 为每个网址添加权重或优先级
  • 监控网站服务何时停止

如果只是在Stackoverflow等特定网站上进行抓取,我已经修改了代码以进行递归抓取。将此代码进一步转换为多线程形式将是微不足道的。它使用bloomfilter来确保它不需要再次抓取同一页面。让我提前警告,在进行爬行时仍会有意外陷阱。像Scrapy,Nutch或Heritrix这样的成熟爬行软件可以做得更好。

import requests
from bs4 import BeautifulSoup as Soup, SoupStrainer
from bs4.element import Tag
from bloom_filter import BloomFilter
from Queue import Queue
from urlparse import urljoin, urlparse

visited = BloomFilter(max_elements=100000, error_rate=0.1)
visitlist = Queue()

def isurlabsolute(url):
    return bool(urlparse(url).netloc)

def visit(url):
    print "Visiting %s" % url
    visited.add(url)
    return requests.get(url)


def parsehref(response):
    if response.status_code == 200:
        for link in Soup(response.content, 'lxml', parse_only=SoupStrainer('a')):
            if type(link) == Tag and link.has_attr('href'):
                href = link['href']
                if isurlabsolute(href) == False:
                    href = urljoin(response.url, href)
                href = str(href)
                if href not in visited:
                    visitlist.put_nowait(href)
                else:
                    print "Already visited %s" % href
    else:
        print "Got issues mate"

if __name__ == '__main__':
    visitlist.put_nowait('http://www.stackoverflow.com/')
    while visitlist.empty() != True:
        url = visitlist.get()
        resp = visit(url)
        parsehref(resp)
        visitlist.task_done()
    visitlist.join()