我正在尝试抓取一个网站,该网站上有不同版本的文章,其中每个版本都是一个新页面。
单选按钮的代码如下:
<tr class="pre-owned " ng-click="selectVariation('66c3080d-c1be-43e5-a0b8-abc336cd12a0');" data-index="2">
<td class="selector">
<input type="radio" ng-value="'66c3080d-c1be-43e5-a0b8-abc336cd12a0'" value="66c3080d-c1be-43e5-a0b8-abc336cd12a0" name="3ed43cb0-fbd5-11e4-9e07-879c14e1f343" ng-model="productDetail.variation_selection">
</td>
<td class="year" title="2013">2013</td>
URL为:https://www.watchmaster.com/de/rolex/gmt-master-ii/116710-ln/HFEYFOPGA5?reference_code=DB34S7NV8N
每个变体都有相同的URL,但具有附加的唯一商品ID(例如WBKUQOEXZU):
https://www.watchmaster.com/de/rolex/gmt-master-ii/116710-ln/HFEYFOPGA5?reference_code=WBKUQOEXZU
<slick ng-cloak ng-if="product.reference_code == 'WBKUQOEXZU'"
如何抓取所有变体并遵循javascript创建的URL?
我的规则如下:
allowed_domains = ['www.watchmaster.com']
start_urls = ['https://www.watchmaster.com/de/']
rules = (
# parse article pages
Rule(
LinkExtractor(allow=['.*/de/((?!shop).)*/.*/([\s\S]){10}\?reference_code.*$']),
callback='parse_item'
),
# follow other urls that make sense (basically the listings)
Rule(
# LinkExtractor(allow=['.*/de/shop/((?!(/tel:|sugsrc)).)*$']),
LinkExtractor(allow=['.*/de/((?!(/tel:|sugsrc|reference_code|sell|brand|gclid|bracelet_material|dialtype|sort|condition)).)*$']),
follow=True
),
)