我想找到具有特定regex
的网页的网址。我在scrapy
中使用了python
个包。
我的代码看起来像这样
name = 'testingcode'
start_urls = ['http://dinoopnair.blogspot.in/'] # urls from which the spider will start crawling
rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True),
# r'page/\d+' : regular expression for http://isbullsh.it/page/X URLs
Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback='parse_blogpost',follow=True)]
# r'\d{4}/\d{2}/\w+' : regular expression for http://isbullsh.it/YYYY/MM/title URLs
def parse_blogpost(self, response):
print response.url
工作正常。现在我想获得链接的文本。 例如
<a href="http://dinoopnair.blogspot.in/2014/07/facebook-search-and-elastic-search.html">facebook search and elastic search</a>
这是满足我们正则表达式的文章链接之一。我想在a
标签之间获得“facebook搜索和弹性搜索”文本。
如何从回调功能的response
参数中找到文本?
答案 0 :(得分:1)
我认为这将满足您的需求,
class TestSpider(Spider): #inherit from Spider intead of CrawlSpider
name = 'testingcode'
start_urls = ['http://dinoopnair.blogspot.in/']
def parse(self, response):
base_selector = response.xpath('//h3[@class="post-title entry-title"]')
for sel in base_selector:
link = sel.xpath('./a/@href').extract()
link_text = sel.xpath('./a/text()').extract()
# clean the data
link = link[0] if link else 'n/a'
link_text = link_text[0].strip() if link else 'n/a'
print link, link_text
修改强>
通用代码,因为用户有几个start-urls
from scrapy.selector import Selector
# other codes here
def parse(self, response):
# change the regex accordingly
links = response.xpath('//a').re(r'href=".*\d{4}/\d{2}/.*')
for link in links:
sell = Selector(text='<a '+link)
link_text = sell.xpath('//a//text()').extract()
url = sell.xpath('//a/@href').extract()
link_text = ' '.join(link_text).strip() if link else 'n/a'
url = url[0] if link else 'n/a'
print(link_text, url)