获取Web机器人正确抓取网站的所有页面

时间:2015-02-27 03:53:31

标签: python web-scraping web-crawler beautifulsoup

我正在尝试浏览网站的所有网页,并提取某个标记/类的所有实例。

它似乎是一遍又一遍地从同一页面中提取信息,但我不确定原因,因为len(urls) #The stack of URL's being scraped中有一个钟形曲线变化,这让我想到了我&# 39; m至少爬过链接,但我可能会不正确地拉出/打印信息。

import urllib
import urlparse
import re
from bs4 import BeautifulSoup

url = "http://weedmaps.com"

如果我尝试仅使用基本的weedmaps.com网址,则不会打印任何内容,但如果我从一个页面开始,该页面包含我正在寻找... url = "https://weedmaps.com/dispensaries/shakeandbake"的数据类型,那么它会拉动信息输出,但它会一遍又一遍地打印相同的信息。

urls = [url] # Stack of urls to scrape
visited = [url] # Record of scraped urls
htmltext = urllib.urlopen(urls[0]).read()

# While stack of urls is greater than 0, keep scraping for links
while len(urls) > 0:
    try:
        htmltext = urllib.urlopen(urls[0]).read()

# Except for visited urls
    except:
        print urls[0]  

# Get and Print Information
    soup = BeautifulSoup(htmltext)
    urls.pop(0) 
    info = soup.findAll("div", {"class":"story-heading"})

    print info

# Number of URLs in stack
    print len(urls)

# Append Incomplete Tags    
    for tag in soup.findAll('a',href=True):
        tag['href'] = urlparse.urljoin(url,tag['href'])
        if url in tag['href'] and tag['href'] not in visited:
            urls.append(tag['href'])
            visited.append(tag['href'])

1 个答案:

答案 0 :(得分:3)

您当前代码的问题在于您放入队列的网址(urls)指向同一页面,但指向不同的锚点,例如:

换句话说,tag['href'] not in visited条件不会过滤指向同一页面的不同网址,而是过滤不同的网址。

从我看到的,你重塑网络抓取框架。但是已经有一个可以节省您的时间,使您的网络抓取代码组织和清洁,并使其显着快于您当前的解决方案 - Scrapy

您需要CrawlSpider,配置rules以关注链接,例如:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class MachineSpider(CrawlSpider):
    name = 'weedmaps'
    allowed_domains = ['weedmaps.com']
    start_urls = ['https://weedmaps.com/dispensaries/shakeandbake']

    rules = [
        Rule(LinkExtractor(allow=r'/dispensaries/'), callback='parse_hours')
    ]

    def parse_hours(self, response):
        print response.url

        for hours in response.css('span[itemid="#store"] div.row.hours-row div.col-md-9'):
            print hours.xpath('text()').extract()

而不是打印,您的回调应该返回或生成Item个实例,以后可以以不同的方式在管道中保存到文件或数据库或进程。