我尝试在同一个域中抓取多个网址。我必须在字符串中列出url列表。我想在字符串中搜索正则表达式并找到网址。但是re.match()总是不返回。我测试我的正则表达式并且它正常工作。这是我的代码:
# -*- coding: UTF-8 -*-
import scrapy
import codecs
import re
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import Request
from scrapy.selector import HtmlXPathSelector
from hurriyet.items import HurriyetItem
class hurriyet_spider(CrawlSpider):
name = 'hurriyet'
allowed_domains = ['hurriyet.com.tr']
start_urls = ['http://www.hurriyet.com.tr/gundem/']
rules = (Rule(SgmlLinkExtractor(allow=('\/gundem(\/\S*)?.asp$')),'parse',follow=True),)
def parse(self, response):
image = HurriyetItem()
text = response.xpath("//a/@href").extract()
print text
urls = ''.join(text)
page_links = re.match("(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))", urls, re.M)
image['title'] = response.xpath("//h1[@class = 'title selectionShareable'] | //h1[@itemprop = 'name']/text()").extract()
image['body'] = response.xpath("//div[@class = 'detailSpot']").extract()
image['body2'] = response.xpath("//div[@class = 'ctx_content'] ").extract()
print page_links
return image, text
答案 0 :(得分:0)
无需使用re
模块,Scrapy选择器有built in feature for regex filtering:
def parse(self, response):
...
page_links = response.xpath("//a/@href").re('your_regex_expression')
...
话虽如此,我建议你首先在Scrapy shell中使用这种方法,以确保你的正则表达式确实有效。因为我不希望人们尝试调试一英里长的正则表达式 - 它基本上是一种只写的语言:)