跟随无线电按钮上的scrapy

时间:2018-11-30 22:28:46

标签: scrapy

我正在尝试抓取一个网站,该网站上有不同版本的文章,其中每个版本都是一个新页面。

单选按钮的代码如下:

<tr class="pre-owned " ng-click="selectVariation(&#039;66c3080d-c1be-43e5-a0b8-abc336cd12a0&#039;);" data-index="2">
        <td class="selector">
            <input type="radio" ng-value="'66c3080d-c1be-43e5-a0b8-abc336cd12a0'" value="66c3080d-c1be-43e5-a0b8-abc336cd12a0" name="3ed43cb0-fbd5-11e4-9e07-879c14e1f343" ng-model="productDetail.variation_selection">
        </td>
        <td class="year" title="2013">2013</td>

URL为:https://www.watchmaster.com/de/rolex/gmt-master-ii/116710-ln/HFEYFOPGA5?reference_code=DB34S7NV8N

每个变体都有相同的URL,但具有附加的唯一商品ID(例如WBKUQOEXZU):

https://www.watchmaster.com/de/rolex/gmt-master-ii/116710-ln/HFEYFOPGA5?reference_code=WBKUQOEXZU

<slick  ng-cloak                                     ng-if="product.reference_code == 'WBKUQOEXZU'"

如何抓取所有变体并遵循javascript创建的URL?

我的规则如下:

allowed_domains = ['www.watchmaster.com']
start_urls = ['https://www.watchmaster.com/de/']

rules = (

    # parse article pages
    Rule(
        LinkExtractor(allow=['.*/de/((?!shop).)*/.*/([\s\S]){10}\?reference_code.*$']), 
        callback='parse_item'
    ),

    # follow other urls that make sense (basically the listings)
    Rule(
        # LinkExtractor(allow=['.*/de/shop/((?!(/tel:|sugsrc)).)*$']), 
        LinkExtractor(allow=['.*/de/((?!(/tel:|sugsrc|reference_code|sell|brand|gclid|bracelet_material|dialtype|sort|condition)).)*$']), 
        follow=True
    ),
)

0 个答案:

没有答案