用scrapy抓取动态内容

时间:2015-06-03 08:20:38

标签: python web-scraping web-crawler scrapy

我正在尝试从Google Play商店获取最新评论。我正在关注此问题以获取最新评论here

上面链接中指定的方法的答案适用于scrapy shell,但是当我在我的抓取工具中尝试这个时,它会被完全忽略。

代码段:

import re
import sys
import time
import urllib
import urlparse

from scrapy import Spider
from scrapy.spider import BaseSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor

from play.items import PlayApp

class PlaySpider(CrawlSpider):
    name = "play"
    allowed_domains = ["play.google.com"]
    start_urls = [
            "https://play.google.com/store/apps"
        ]

    rules = (
        Rule(LxmlLinkExtractor(allow=('/store/apps$', )), callback='parseCategory',follow=True),
    )

    def parseCategory(self, response):
        """
            gets categories from store home page call parseLinks for each category
        """
        #something here......
        yield Request(categoryapps, callback=self.parseLinks)

    def parseLinks(self, response):

        '''
        get all the links from the category page and then 
        pasess individual links to parseApp function.
        '''    
        #something here

        yield Request(link, callback=self.parseApp)

    def parseApp(self, response):

        '''
        parses apps page to get info about the app
        '''

        #application page parsing ......        

        frmdata = {"id": "com.supercell.boombeach", "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}
        url = "https://play.google.com/store/getreviews"
        yield FormRequest(url, callback=self.parse_data, formdata=frmdata)

        yield app

    def parse_data(self, response):
        # do stuff with data...
        print '\n\n---------------I am here------------------\n\n'

永远不会调用此函数 parse_data 。在#scrapy IRC和其他几个地方问这个但没有帮助。请帮帮我。

这是终端上的DEBUG响应:

DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=isoft.studios.ncert.ncertbooks)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=af.hindi.stories.booktwo)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.frozenex.latestnewsms)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.aqua.apps.english.hindi.dictionary)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.merriamwebster)
2015-06-03 13:56:08+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=an.HindiTranslate)

因此确实发送了POST请求,但未调用回调方法。

1 个答案:

答案 0 :(得分:1)

好像您还没有更改表单数据中的id

def parseApp(self, response):
    apps = list(set(response.xpath('//a[@class="card-click-target"]/@href').extract()))
    url = "https://play.google.com/store/getreviews"
    for app in apps:
        _id = app.strip('/store/apps/details?id=')
        form_data = {"id": _id, "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}
        sleep(5)
        yield FormRequest(url=url, formdata=form_data, callback=self.parse_data)

def parse_app(self, response):
    response_data = re.findall("\[\[.*", response.body)
    if response_data:
        try:
            text = json.loads(response_data[0] + ']')
            sell = Selector(text=text[0][2])
        except:
            pass
        # do whatever you want to extract using sell.xapth('YOUR_XPATH_HERE')

清理数据后的样本审核,您将获得类似的内容

<div class="single-review">
    <a href="/store/people/details?id=106726831005267540508">
        <img class="author-image" alt="Lorence Gerona avatar image" src="https://lh3.googleusercontent.com/uFp_tsTJboUY7kue5XAsGA=w48-c-h48">
    </a>
    <div class="review-header" data-expand-target="" data-reviewid="gp:AOqpTOHnsExa_P6JFRJD6HF5h71fpY91tNaEODjtfiTu-zPFki9ZnYsNp1HEcGFpGEfu9xqwJL_j-03Tx0e9lw">
        <div class="review-info">
            <span class="author-name">
                <a href="/store/people/details?id=106726831005267540508">Lorence Gerona</a>
            </span>
            <span class="review-date">3 June 2015</span>
            <a class="reviews-permalink" href="/store/apps/details?id=com.supercell.boombeach&amp;reviewId=Z3A6QU9xcFRPSG5zRXhhX1A2SkZSSkQ2SEY1aDcxZnBZOTF0TmFFT0RqdGZpVHUtelBGa2k5Wm5Zc05wMUhFY0dGcEdFZnU5eHF3Skxfai0wM1R4MGU5bHc" title="Link to this review"></a> <div class="review-source" style="display:none">

        </div>
        <div class="review-info-star-rating">
            <div class="tiny-star star-rating-non-editable-container" aria-label="Rated 5 stars out of five stars">
                <div class="current-rating" style="width: 100%;">

                </div>
            </div>
        </div>
    </div>
    <div class="rate-review-wrapper">
        <div class="play-button icon-button small rate-review" title="Spam" data-rating="SPAM">
            <div class="icon spam-flag"></div>
        </div>
        <div class="play-button icon-button small rate-review" title="Helpful" data-rating="HELPFUL">
            <div class="icon thumbs-up"></div>
        </div>
        <div class="play-button icon-button small rate-review" title="Unhelpful" data-rating="UNHELPFUL"> <div class="icon thumbs-down"></div>
    </div>
</div>
</div>
<div class="review-body">
<span class="review-title">Team BOOM BEACH</span>
Amazing game I can defeat hammerman
<div class="review-link" style="display:none">
    <a class="id-no-nav play-button tiny" href="#" target="_blank">Full Review</a>
</div>
</div>
</div>