我正在尝试使用以下link抓取Groupon交易:
从外壳scrapy shell
运行该代码时,我会在页面上看到所有交易。
例如titles = response.css('figure.card-ui').css('div.cui-udc-title-with-subtitle ::text').getall()
为我赢得了37个头衔。
Shell运行让我:
>>> titles = response.css('figure.card-ui').css('div.cui-udc-title-with-subtitle ::text').getall()
>>> titles = [ title.rstrip().lstrip() for title in titles ]
>>> len(titles)
37
>>> titles
[u'Le Bar du Normandy - H\xf4tel Normandy', u'Michel Balmet', u'Passion Chocolat', u'Le Caf\xe9 Clairi\xe8re', u'LES CAVES DU LOUVRE', u"L'artiste Restaurant", u'Auberge Le Relais', u'Le Caf\xe9 des Initi\xe9s', u'La Mar\xe9e (75008)', u'Ko\xef', u'Casa Paco (75116)', u'Capitaine Fracasse', u'LePergol\xe8se', u'Wine Tours Paris', u'La Maison Du Rhum', u'Au Port du Salut', u'Grains Nobles', u"L'artiste Restaurant", u'Michel Balmet, 10e', u'Feyrouz C\xf4t\xe9 Mer', u"L'agap\xe9", u"Restaurant Au Bon'art", u'Shibuya Karaok\xe9', u'Eiffel Croisieres', u'Cfv', u'Made In Italy', u'Fuumi Restaurant', u'OfbPontault', u'Le Jackpot', u'La Brasserie Centrale', u'Le cheval blanc', u'LA CANTINE DES TSARS', u'Restaurant Guy Savoy \xe0 la Monnaie de Paris', u'Chez Ma Cousine', u'MAMABALI', u'LE COSMOS', u'Restaurant Le Sancerre']
>>>
当我通过刮板运行此程序时,我只会得到一小部分结果:
class GrouponSpider(scrapy.Spider):
name = "deals"
start_urls = [
'https://www.groupon.fr/browse/paris?category=bars-et-restaurants&=undefined&gclid=Cj0KCQjwo7foBRD8ARIsAHTy2wm4-T4w6ps1KMDg5eG8S7jDsNco8VxuJIcoQO6OXkSrzQm4TWEe-QkaArFXEALw_wcB&utm_campaign=fr_dt_sea_ggl_txt_naq_sr_cbp_ch1_ybr_k*groupon%2Bparis_m*e_d*Groupon-Paris_g*Paris-Exact_c*96685051824_ap*1t1&utm_medium=cpc&utm_source=google&page0'
]
def parse(self, response):
titles = response.css('figure.card-ui').css('div.cui-udc-title-with-subtitle ::text').getall()
titles = [ title.rstrip().lstrip() for title in titles ]
for title in titles:
yield { 'title' : title }
next_page = response.css('a.next::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
在这种情况下,我得到以下内容(使用标志-o items.csv -t csv
运行),它是所有结果的一小部分:
$ cat items.csv
title
Le Bar du Normandy - Hôtel Normandy
Michel Balmet
Passion Chocolat
Le Café Clairière
Auberge Le Relais
La Marée (75008)
L'artiste Restaurant
关于如何从刮板代码中获得完整结果的任何想法?