使用scrapy获取链接和文本

时间:2015-05-07 07:15:52

标签: python scrapy scrapy-spider

我想找到具有特定regex的网页的网址。我在scrapy中使用了python个包。 我的代码看起来像这样

name = 'testingcode'
start_urls = ['http://dinoopnair.blogspot.in/'] # urls from which the spider will start crawling
rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), 
    # r'page/\d+' : regular expression for http://isbullsh.it/page/X URLs
    Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback='parse_blogpost',follow=True)]
    # r'\d{4}/\d{2}/\w+' : regular expression for http://isbullsh.it/YYYY/MM/title URLs

def parse_blogpost(self, response):
    print response.url

工作正常。现在我想获得链接的文本。 例如

<a href="http://dinoopnair.blogspot.in/2014/07/facebook-search-and-elastic-search.html">facebook search and elastic search</a>

这是满足我们正则表达式的文章链接之一。我想在a标签之间获得“facebook搜索和弹性搜索”文本。 如何从回调功能的response参数中找到文本?

1 个答案:

答案 0 :(得分:1)

我认为这将满足您的需求,

class TestSpider(Spider): #inherit from Spider intead of CrawlSpider
        name = 'testingcode'
        start_urls = ['http://dinoopnair.blogspot.in/']

        def parse(self, response):
            base_selector = response.xpath('//h3[@class="post-title entry-title"]')
            for sel in base_selector:
               link = sel.xpath('./a/@href').extract()
               link_text = sel.xpath('./a/text()').extract()
               # clean the data
               link = link[0] if link else 'n/a'
               link_text = link_text[0].strip() if link else 'n/a'
               print link, link_text

修改

通用代码,因为用户有几个start-urls

from scrapy.selector import Selector
# other codes here 

def parse(self, response):
    # change the regex accordingly
    links = response.xpath('//a').re(r'href=".*\d{4}/\d{2}/.*')
    for link in links:
        sell = Selector(text='<a '+link)
        link_text = sell.xpath('//a//text()').extract()
        url = sell.xpath('//a/@href').extract()
        link_text = ' '.join(link_text).strip() if link else 'n/a'
        url = url[0] if link else 'n/a'
        print(link_text, url)