我有一个带有以下规则的刮刀:
rules = (
Rule(LinkExtractor(allow=('\S+list=\S+'))),
Rule(LinkExtractor(allow=('\S+list=\S+'))),
Rule(LinkExtractor(allow=('\S+view=1\S+')), callback='parse_archive'),
)
如您所见,第2和第3条规则完全相同。
我想要做的是通过仅引用页面中的特定位置来告诉我们感兴趣的链接。为方便起见,我发送了相应的XPath,尽管我更喜欢基于BeatifullSoup语法的解决方案。
//*[@id="main_frame"]/tbody/tr[3]/td[2]/table/tbody/tr/td/div/table/tbody/tr/td[1]
//*[@id="main_frame"]/tbody/tr[3]/td[2]/table/tbody/tr/td/div/form/table/tbody/tr[1]
//*[@id="main_frame"]/tbody/tr[3]/td[2]/table/tbody/tr/td/div/form/table/tbody/tr[2]
修改
让我举个例子。让我们假设我想在Scrapy的官方页面上提取五个(六个中)链接:
这是我的蜘蛛。有什么想法吗?
class dmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["scrapy.org"]
start_urls = [
"http://scrapy.org/",
]
rules = (
Rule(LinkExtractor(allow=('\S+/'), restrict_xpaths=('/html/body/div[1]/div/ul')), callback='first_level'),
)
def first_level(self, response):
taco = dmozItem()
taco['basic_url'] = response.url
return taco
答案 0 :(得分:2)
可以使用restrict_xpaths
参数完成此操作。请参阅LxmlLinkExtractor documentation
修改强>
您还可以将列表传递给restrict_xpaths
。
编辑2:
应该有效的完整示例:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class dmozItem(scrapy.Item):
basic_url = scrapy.Field()
class dmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["scrapy.org"]
start_urls = [
"http://scrapy.org/",
]
def clean_url(value):
return value.replace('/../', '/')
rules = (
Rule(
LinkExtractor(
allow=('\S+/'),
restrict_xpaths=(['.//ul[@class="navigation"]/a[1]',
'.//ul[@class="navigation"]/a[2]',
'.//ul[@class="navigation"]/a[3]',
'.//ul[@class="navigation"]/a[4]',
'.//ul[@class="navigation"]/a[5]']),
process_value=clean_url
),
callback='first_level'),
)
def first_level(self, response):
taco = dmozItem()
taco['basic_url'] = response.url
return taco