我是Scrapy的新手,我想做的是制作一个只能跟踪给定start_urls
举个例子,我只想让一个抓取工具通过将start_urls
设置为https://www.airbnb.com/s?location=New+York%2C+NY&checkin=&checkout=&guests=1的AirBnB列表
我只想抓取xpath //*[@id="results"]
目前我正在使用以下代码抓取所有链接,如何将其修改为仅抓取//*[@id="results"]
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class BSpider(CrawlSpider):
name = "bt"
#follow = True
allowed_domains = ["mydomain.com"]
start_urls = ["http://myurl.com/path"]
rules =(Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item', follow=True),)
def parse_item(self, response):
{parse code}
任何有关正确方向的小贴士都会非常感激, 谢谢!
答案 0 :(得分:8)
您可以将restrict_xpaths关键字参数传递给SgmlLinkExtractor。来自the docs: