Firstcry.com Scraper问题

时间:2015-07-09 12:45:52

标签: ajax web-scraping xmlhttprequest scrapy scrapy-spider

我正在尝试抓取以下网站 - www.firstcry.com。该网站使用AJAX(以XHR的形式)显示其搜索结果。

现在,如果您看到我的代码, jsonresponse 变量包含网站的json输出。现在,当我尝试打印它时,它包含许多 \(反斜杠)

现在,如果您正确地在 jsonresponse 变量下面看到我的代码,我已经评论了几行。那些是我的尝试(我在阅读了几个类似的问题后尝试了,在Stack Overflow上)以删除所有反斜杠以及这些 - u',这些也存在于那里

但是,经过所有这些尝试后,我无法移除所有 反斜杠 u'

现在,如果我不删除所有这些,我无法使用它的键访问jsonresponse,所以,删除所有这些都是非常必要的。

请帮我解决这个问题。如果你提供代码,特别是我的情况(问题),而不是一般代码,那会更好!

我的代码在这里 - :

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import json , simplejson , ujson

#query=raw_input("Enter a product to search for= ")
query='bag'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.firstcry.com"]


    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,2):
            temp = "http://www.firstcry.com/svcs/search.svc/GetSearchPagingProducts_new?PageNo=" + str(i) + "&PageSize=20&SortExpression=Relevance&SubCatId=&BrandId=&Price=&OUTOFSTOCK=&DISCOUNT=&Q=" + query1 + "&rating="
            task_urls.append(temp)
            i=i+1

        start_urls = (task_urls)
        p=len(task_urls)
        return [ Request(url = start_url) for start_url in start_urls ]


    def parse(self, response):
        print response

        items = []
        jsonresponse = dict(ujson.loads(response.body_as_unicode()))
#       jsonresponse = jsonresponse.replace("\\","")
#       jsonresponse = jsonresponse.decode('string_escape')
#       jsonresponse = ("%r" % json.loads(response.body_as_unicode()))
#       d= jsonresponse.json()
        #jsonresponse = jsonresponse.strip("/")
#       print jsonresponse
#       print d
#       print json.dumps("%r" % jsonresponse, indent=4, sort_keys=True)
#       a = simplejson.dumps(simplejson.loads(response.body_as_unicode()).replace("u\'","\'"), indent=4, sort_keys=True)
        #a= json.dumps(json.JSONDecoder().decode(jsonresponse))
        #a = ujson.dumps((ujson.loads(response.body_as_unicode())) , indent=4 )
        a=json.dumps(jsonresponse, indent=4)
        a=a.decode('string_escape')
        a=(a.decode('string_escape'))
#       a.gsub('\\', '')
        #a = a.strip('/')
        #print (jsonresponse)
        print a
        #print "%r" % a
#       print "%r" % json.loads(response.body_as_unicode())

        p=(jsonresponse["hits"])["hit"]
#       print p
#       raw_input()
        for x in p:
            item = DmozItem()
            item['productname'] = str(x['title'])
            item['product_link'] = "http://www.yepme.com/Deals1.aspx?CampId="+str(x["uniqueId"])
            item['current_price']='Rs. ' + str(x["price"])

            try:            
                p=x["marketprice"]
                item['mrp'] = 'Rs. ' + str(p)

            except:
                item['mrp'] = item['current_price']

            try:            
                item['offer'] = str(x["promotionalMsg"])
            except:
                item['offer'] = str('No additional offer available')

            item['imageurl'] = "http://staticaky.yepme.com/newcampaign/"+str(x["uniqueId"])[:-1]+"/"+str(x["smallimage"])
            item['outofstock_status'] = str('In Stock')
            items.append(item)

        print (items)

spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("CONCURRENT_REQUESTS" , 100)
#)
settings.set( "DEPTH_PRIORITY" , 1)
settings.set("SCHEDULER_DISK_QUEUE" , "scrapy.squeues.PickleFifoDiskQueue")
settings.set( "SCHEDULER_MEMORY_QUEUE" , "scrapy.squeues.FifoMemoryQueue")
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

1 个答案:

答案 0 :(得分:2)

无需复杂化。不要使用ujsonresponse.body_as_unicode(),然后将其转换为dict,只需使用常规jsonresponse.body

$ scrapy shell "http://www.firstcry.com/svcs/search.svc/GetSearchPagingProducts_new?PageNo=1&PageSize=20&SortExpression=Relevence&SubCatId=&BrandId=&Price=&OUTOFSTOCK=&DISCOUNT=&Q=bag&rating="
...
>>> jsonresponse = json.loads(response.body)
>>> jsonresponse.keys()
[u'ProductResponse']

这对我的例子很有用。看起来你有点深入“黑客寻找答案”模式;)

我会注意到这一行...

p=(jsonresponse["hits"])["hit"]

...将无法在您的代码中使用。解析JSON后jsonresponse中唯一可用的键是“ProductResponse”。该键包含另一个JSON对象,然后您可以像这样访问:

>>> product_response = json.loads(jsonresponse['ProductResponse'])
>>> product_response['hits']['hit']
[{u'fields': {u'_score': u'56.258633',
    u'bname': u'My Milestones',
    u'brandid': u'450',
...

我认为这会为您提供您希望在p变量中获得的内容。