Question

我需要指导开发刮刀。

我需要构建一个自定义抓取工具，从3个电子商务网站中检索所有产品。

我用Scrapy构建了PoC刮刀，然而，这个刮刀有一个流程：

刮刀需要抓取一个给定的类别，直到达到深度等级3才能到达和访问我需要的页面，深度为1级。

例如，抓取需要遵循此序列：

开始： domain.com
domain.com/category/sub_categry/mini_sub_category
domain.com/product1 和 domain.com/product2

只有达到深度级别2（抓取sub_categories）才能访问 product1 和 product2 的网址。

我的问题是我是否可以自定义scrapy以自动跟踪此序列 OR 我是否需要使用Beautifouldsoup定制一个刮刀并手动提供每个sub_category并让bs4从那里开始刮擦？

这是我的Scrapy代码

class DomainsSpider(CrawlSpider):
name = 'domains'
allowed_domains = ['www.amazon.com']
start_urls = ['http://www.amazon.com/']


rules = [Rule(LinkExtractor(canonicalize=True, unique=True),follow=True, callback="parse_items")]


def parse_items(self, response):

    # create the soup for the domain
    soup = BeautifulSoup(response.text, 'html.parser')
    #check if proxy is working
    if not soup.title.string:
        yield Request(url=response.url, dont_filter=True)


#extract the title      
    heading_1_raw = response.selector.xpath('//h1/text()').extract()
    heading_1_strip = [s.strip() for s in heading_1_raw]
    heading_1 = []


    for h1_text in range(0, len(heading_1_strip)):
        if str(heading_1_strip[h1_text]) != "":
            heading_1.append(heading_1_strip[h1_text])


    price_raw = response.selector.xpath('//p[contains(@class, "product-new-price")]//text()').extract()


    product_code_text = soup.find_all(string=re.compile("Cod produs"))


    yield {
        'url' : response.url,
        'page_title': soup.title.string,
        #'h1': h1s[0],
        'h1' : heading_1[0],
        'price' : price_raw,
        'product_code' : product_code_text

        }

Answer 1

您可以使用scrapy轻松完成所需内容，只需要为CrawlSpider提供一个描述如何进行抓取的规则列表。

这样简单的事情可能会起到作用：

rules = [
    Rule(LinkExtractor(allow=['/category/'])),
    Rule(LinkExtractor(allow=['/product']), callback='parse_items')
]

如果您无法理解或修改此代码，建议您阅读rules和link extractors。

此外，您无需在蜘蛛中使用BeautifulSoup，内置的parsel选择器能够提取您想要的任何数据。

Scrapy或Beautifoulsoup用于定制刮刀？

1 个答案: