使用scrapy获取部分数据抓取站点,而相同的url从shell获取我完整的结果

时间:2019-06-24 08:22:34

标签: python scrapy scrapy-shell

我正在尝试使用以下link抓取Groupon交易:

从外壳scrapy shell运行该代码时,我会在页面上看到所有交易。 例如titles = response.css('figure.card-ui').css('div.cui-udc-title-with-subtitle ::text').getall()为我赢得了37个头衔。

Shell运行让我:

>>> titles = response.css('figure.card-ui').css('div.cui-udc-title-with-subtitle ::text').getall()
>>> titles = [ title.rstrip().lstrip()  for title in titles ]
>>> len(titles)
37
>>> titles
[u'Le Bar du Normandy - H\xf4tel Normandy', u'Michel Balmet', u'Passion Chocolat', u'Le Caf\xe9 Clairi\xe8re', u'LES CAVES DU LOUVRE', u"L'artiste Restaurant", u'Auberge Le Relais', u'Le Caf\xe9 des Initi\xe9s', u'La Mar\xe9e (75008)', u'Ko\xef', u'Casa Paco (75116)', u'Capitaine Fracasse', u'LePergol\xe8se', u'Wine Tours Paris', u'La Maison Du Rhum', u'Au Port du Salut', u'Grains Nobles', u"L'artiste Restaurant", u'Michel Balmet, 10e', u'Feyrouz C\xf4t\xe9 Mer', u"L'agap\xe9", u"Restaurant Au Bon'art", u'Shibuya Karaok\xe9', u'Eiffel Croisieres', u'Cfv', u'Made In Italy', u'Fuumi Restaurant', u'OfbPontault', u'Le Jackpot', u'La Brasserie Centrale', u'Le cheval blanc', u'LA CANTINE DES TSARS', u'Restaurant Guy Savoy \xe0 la Monnaie de Paris', u'Chez Ma Cousine', u'MAMABALI', u'LE COSMOS', u'Restaurant Le Sancerre']
>>> 

当我通过刮板运行此程序时,我只会得到一小部分结果:

class GrouponSpider(scrapy.Spider):
    name = "deals"

    start_urls = [
            'https://www.groupon.fr/browse/paris?category=bars-et-restaurants&=undefined&gclid=Cj0KCQjwo7foBRD8ARIsAHTy2wm4-T4w6ps1KMDg5eG8S7jDsNco8VxuJIcoQO6OXkSrzQm4TWEe-QkaArFXEALw_wcB&utm_campaign=fr_dt_sea_ggl_txt_naq_sr_cbp_ch1_ybr_k*groupon%2Bparis_m*e_d*Groupon-Paris_g*Paris-Exact_c*96685051824_ap*1t1&utm_medium=cpc&utm_source=google&page0'
    ]

    def parse(self, response):
        titles = response.css('figure.card-ui').css('div.cui-udc-title-with-subtitle ::text').getall()
        titles = [ title.rstrip().lstrip()  for title in titles ]
        for title in titles:
            yield { 'title' : title }

    next_page = response.css('a.next::attr(href)').get()

    if next_page is not None:
        yield response.follow(next_page, callback=self.parse)

在这种情况下,我得到以下内容(使用标志-o items.csv -t csv运行),它是所有结果的一小部分:

$ cat items.csv
    title
    Le Bar du Normandy - Hôtel Normandy
    Michel Balmet
    Passion Chocolat
    Le Café Clairière
    Auberge Le Relais
    La Marée (75008)
    L'artiste Restaurant

关于如何从刮板代码中获得完整结果的任何想法?

0 个答案:

没有答案