回调功能在scrapy中无法正常工作

时间:2014-06-22 05:18:07

标签: python web-scraping scrapy

您好我是Scrapy的新手,我正在尝试刮掉一个ASP.net网站。我已经确定了表单的参数,这些参数在表单发布时调用,并在我的代码中使用它们。然而,即使数据被从第一页抓取,也不会在此之后删除数据,即使蜘蛛指示其他页面已被成功爬行。试图找出它不起作用的原因:S。 'clean_parsed_string'和'get_parsed_string'是我自己的函数,用于获取字符串元素并已在其他网站上测试过。

def parse(self, response):
    sel = Selector(response)
    snodes = sel.xpath('//div[@id="hotel_result_hotel_item"]')

    for snode in snodes:
        hotel_item = Hotel_Items()
        hotel_item['name'] = clean_parsed_string(get_parsed_string(snode_restaurant, 'div[@class=""]/table[@class="widthfull"]//a[@class="hot_name"]/text()'))
        hotel_item['address'] = clean_parsed_string(get_parsed_string(snode_restaurant, 'div[@class=""]/table[@class="widthfull"]//span[@class="fontsmalli"]/text()'))
        hotel_item['stars'] = clean_parsed_string(get_parsed_string(snode_restaurant, 'div[@class=""]/table[@class="widthfull"]//div[@class="mbluebold col_hotelinfo_name"]/input/@class'))
        hotel_item['room1'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[1]/td[1]/p[@class="roomtype"]/span/text()'))
        hotel_item['room1_price_USD'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[1]/td[5]/p[@class="ratepernight"]/span/text()'))
        hotel_item['room2'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[2]/td[1]/p[@class="roomtype"]/span/text()'))
        hotel_item['room2_price_USD'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[2]/td[5]/p[@class="ratepernight"]/span/text()'))
        hotel_item['room3'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[3]/td[1]/p[@class="roomtype"]/span/text()'))
        hotel_item['room3_price_USD'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[3]/td[5]/p[@class="ratepernight"]/span/text()'))
        hotel_item['room4'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[4]/td[1]/p[@class="roomtype"]/span/text()'))
        hotel_item['room4_price_USD'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[4]/td[5]/p[@class="ratepernight"]/span/text()'))
        yield hotel_item


    viewstate = sel.xpath('//input[@name="__VIEWSTATE"]/@value').extract()[0]
    yield FormRequest.from_response(response,formdata={'ctl00$scriptmanager1':'ctl00$ContentMain$upResultFooter|ctl00$ContentMain$lbtnFooterNext',
                'ctl00_scriptmanager1_HiddenField':'',
                '__EVENTTARGET':'ctl00$ContentMain$lbtnFooterNext',
                '__EVENTARGUMENT':'',
                '__LASTFOCUS':'',
                '__VIEWSTATE': viewstate,
                '__SCROLLPOSITIONX':'0',
                '__SCROLLPOSITIONY':'0',
                'ctl00$Googlesearch$txtSearch':'',
                'ctl00$ddlCurrency$hidCurrencyChange':'USD',
                'ctl00$ContentMain$hdfMinPrice':'',
                'ctl00$ContentMain$hdfMaxPrice':'',
                'ctl00$ContentMain$ddlSort':'1',    
                'ctl00$ContentMain$hidMenu':'0',
                'ctl00$ContentMain$hidSubMenu':'',
                'ctl00$ContentMain$DestinationSearchBox1$arrivaldate':'06/23/2014',
                'ctl00$ContentMain$DestinationSearchBox1$departdate':'06/25/2014',
                'ctl00$ContentMain$DestinationSearchBox1$controlmode':'1',
                'ctl00$ContentMain$DestinationSearchBox1$jsRooms':'0',  
                'ctl00$ContentMain$DestinationSearchBox1$jsAdults':'0',
                'ctl00$ContentMain$DestinationSearchBox1$jsChildren':'0',
                'ctl00$ContentMain$DestinationSearchBox1$SearchHotel':'no',
                'ctl00$ContentMain$DestinationSearchBox1$ErrorCharLengthMessage':'Please enter at least the first two letters of the name you are looking for.',
                'ctl00$ContentMain$DestinationSearchBox1$TextError':'Please enter the name of a Country, City, Airport, Area, Landmark or Hotel to proceed.',
                'ctl00$ContentMain$DestinationSearchBox1$TextSearch1$tmptextDefault':'Country, City, Airport, Area, Landmark',
                'ctl00$ContentMain$DestinationSearchBox1$TextSearch1$txtSearch':'Colombo',
                'ctl00$ContentMain$DestinationSearchBox1$ddlDistance':'1',
                'ddlCheckInDay':'23',
                'ddlCheckInMonthYear':'6,2014',
                'datepickerarrival':'',
                'ddlCheckOutDay':'25',
                'ddlCheckOutMonthYear':'6,2014',
                'ctl00$ContentMain$DestinationSearchBox1$ddlNights':'2',
                'datepickerdepart':'',
                'ctl00$ContentMain$DestinationSearchBox1$ddlRoom':'1',
                'ctl00$ContentMain$DestinationSearchBox1$ddlAdult':'2',
                'ctl00$ContentMain$DestinationSearchBox1$ddlChildren':'0',
                'ctl00$ContentMain$txtHotelName':'',
                'ctl00$ContentMain$hidHotelList2603':'',
                'ctl00$ContentMain$HotelFilterStarRating$HiddenFilterStatus':'',
                'ctl00$ContentMain$HotelFilterFacilities$HiddenFilterStatus':'',
                'ctl00$ContentMain$HotelFilterAccommodationType$HiddenFilterStatus':'',
                'ctl00$ContentMain$HotelFilterArea$HiddenFilterStatus':'',
                'ctl00$ContentMain$HotelFilterChainAndBrand$HiddenFilterStatus':'',
                #'__ASYNCPOST':'true'
                },
            callback=self.parse,clickdata=None)

1 个答案:

答案 0 :(得分:0)

即使您的POST标头错误,网站也可能会返回200 OK状态。尝试使用scrapy shell并使用您创建的表单数据提交FormRequest,以查看该网站返回的内容。

我建议使用与此类似的东西,以避免键入每个标题并避免可能的错误:

formdata = {}

for hid in sel.xpath('//input[@type="hidden" and @value and @name]'):
    formdata[hid.xpath('@name').extract()[0]] = hid.xpath('@value').extract()[0]