使用Scrapy抓取网站并仅抓取包含关键字的网页

时间:2015-07-01 09:18:31

标签: python web-scraping web-crawler scrapy

我正在尝试抓取各种网站,寻找感兴趣的特定关键字,只抓取这些网页。我编写的脚本是作为独立的Python脚本运行而不是传统的Scrapy项目结构(遵循此example)并使用CrawlSpider类。我们的想法是,从给定的主页,Spider将抓取该域中的页面,并仅从包含该关键字的页面中抓取链接。当我找到一个包含关键字的页面时,我也试图保存页面的副本。此问题的先前版本与语法错误相关(请参阅下面的评论,感谢@tegancp帮助我清除它)但现在虽然我的代码运行,但我仍然无法仅按照预期在感兴趣的页面上抓取链接。

我想我想要i)删除LinkExtractor功能中__init__的来电或ii)仅从LinkExtractor内拨打__init__,但需要基于规则我在访问该页面时发现的内容而不是URL的某些属性。我不能这样做)因为CrawlSpider类想要一个规则而我不能做ii)因为LinkExtractor没有process_links选项,就像旧SgmlLinkExtractor一样似乎已被弃用。我是Scrapy的新手,所以想知道我唯一的选择是写自己的LinkExtractor吗?

from scrapy.crawler import Crawler
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy import log, signals, Spider, Item, Field
from scrapy.settings import Settings
from twisted.internet import reactor


# define an item class
class GenItem(Item):
    url = Field()

# define a spider
class GenSpider(CrawlSpider):
    name = "genspider3"

    # requires 'start_url', 'allowed_domains' and 'folderpath' to be passed as string arguments IN THIS PARTICULAR ORDER!!!
    def __init__(self):

        self.start_urls = [sys.argv[1]]
        self.allowed_domains = [sys.argv[2]]
        self.folder = sys.argv[3]
        self.writefile1 = self.folder + 'hotlinks.txt'
        self.writefile2 = self.folder + 'pages.txt'

        self.rules = [Rule(LinkExtractor(allow_domains=(sys.argv[2],)), follow=True, callback='parse_links')]
        super(GenSpider, self).__init__()

    def parse_start_url(self, response):
        # get list of links on start_url page and process using parse_links
        list(self.parse_links(response))

    def parse_links(self, response):
       # if this page contains a word of interest save the HTML to file and crawl the links on this page

        theHTML = response.body
        if 'keyword' in theHTML:
            with open(self.writefile2, 'a+') as f2:
                f2.write(theHTML + '\n')
            with open(self.writefile1, 'a+') as f1:
                f1.write(response.url + '\n')


            for link in LinkExtractor(allow_domains=(sys.argv[2],)).extract_links(response):
                linkitem = GenItem()
                linkitem['url'] = link.url
                log.msg(link.url)
                with open(self.writefile1, 'a+') as f1:
                    f1.write(link.url + '\n')
                return linkitem



# callback fired when the spider is closed
def callback(spider, reason):
    stats = spider.crawler.stats.get_stats()  # collect/log stats?

    # stop the reactor
    reactor.stop()


# instantiate settings and provide a custom configuration
settings = Settings()
#settings.set('DEPTH_LIMIT', 2)
settings.set('DOWNLOAD_DELAY', 0.25)

# instantiate a crawler passing in settings
crawler = Crawler(settings)

# instantiate a spider
spider = GenSpider()

# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)

# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()

# start logging
log.start(loglevel=log.DEBUG)

# start the reactor (blocks execution)
reactor.run()     

0 个答案:

没有答案