我正在尝试使用Scrapy
(scrapy.org)创建简单的抓取工具。例如,允许item.php
。如何编写允许始终以http://example.com/category/
开头但在GET
参数page
中的网址的规则应该包含任意数量的带有其他参数的数字。这些参数的顺序是随机的。
请帮忙我怎么写这样的规则?
几个有效值是:
以下是代码:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/category/']
rules = (
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
return item
答案 0 :(得分:5)
在字符串的开头测试http://example.com/category/
,在值中包含一个或多个数字的page
参数:
Rule(LinkExtractor(allow=('^http://example.com/category/\?.*?(?=page=\d+)', )), callback='parse_item'),
演示(使用您的示例网址):
>>> import re
>>> pattern = re.compile(r'^http://example.com/category/\?.*?(?=page=\d+)')
>>> should_match = [
... 'http://example.com/category/?sort=a-z&page=1',
... 'http://example.com/category/?page=1&sort=a-z&cache=1',
... 'http://example.com/category/?page=1&sort=a-z#'
... ]
>>> for url in should_match:
... print "Matches" if pattern.search(url) else "Doesn't match"
...
Matches
Matches
Matches
答案 1 :(得分:-2)
尝试这样
import re
p = re.compile(ur'<[^>]+href="((http:\/\/example.com\/category\/)([^"]+))"', re.MULTILINE)
test_str = u"<a class=\"youarehere\" href=\"http://example.com/category/?sort=newest\">newest</a>\n \n<a href=\"http://example.com/category/?sot=frequent\">frequent</a>"
re.findall(p, test_str)