您好我是Scrapy的新手,我正在尝试刮掉一个ASP.net网站。我已经确定了表单的参数,这些参数在表单发布时调用,并在我的代码中使用它们。然而,即使数据被从第一页抓取,也不会在此之后删除数据,即使蜘蛛指示其他页面已被成功爬行。试图找出它不起作用的原因:S。 'clean_parsed_string'和'get_parsed_string'是我自己的函数,用于获取字符串元素并已在其他网站上测试过。
def parse(self, response):
sel = Selector(response)
snodes = sel.xpath('//div[@id="hotel_result_hotel_item"]')
for snode in snodes:
hotel_item = Hotel_Items()
hotel_item['name'] = clean_parsed_string(get_parsed_string(snode_restaurant, 'div[@class=""]/table[@class="widthfull"]//a[@class="hot_name"]/text()'))
hotel_item['address'] = clean_parsed_string(get_parsed_string(snode_restaurant, 'div[@class=""]/table[@class="widthfull"]//span[@class="fontsmalli"]/text()'))
hotel_item['stars'] = clean_parsed_string(get_parsed_string(snode_restaurant, 'div[@class=""]/table[@class="widthfull"]//div[@class="mbluebold col_hotelinfo_name"]/input/@class'))
hotel_item['room1'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[1]/td[1]/p[@class="roomtype"]/span/text()'))
hotel_item['room1_price_USD'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[1]/td[5]/p[@class="ratepernight"]/span/text()'))
hotel_item['room2'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[2]/td[1]/p[@class="roomtype"]/span/text()'))
hotel_item['room2_price_USD'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[2]/td[5]/p[@class="ratepernight"]/span/text()'))
hotel_item['room3'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[3]/td[1]/p[@class="roomtype"]/span/text()'))
hotel_item['room3_price_USD'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[3]/td[5]/p[@class="ratepernight"]/span/text()'))
hotel_item['room4'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[4]/td[1]/p[@class="roomtype"]/span/text()'))
hotel_item['room4_price_USD'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[4]/td[5]/p[@class="ratepernight"]/span/text()'))
yield hotel_item
viewstate = sel.xpath('//input[@name="__VIEWSTATE"]/@value').extract()[0]
yield FormRequest.from_response(response,formdata={'ctl00$scriptmanager1':'ctl00$ContentMain$upResultFooter|ctl00$ContentMain$lbtnFooterNext',
'ctl00_scriptmanager1_HiddenField':'',
'__EVENTTARGET':'ctl00$ContentMain$lbtnFooterNext',
'__EVENTARGUMENT':'',
'__LASTFOCUS':'',
'__VIEWSTATE': viewstate,
'__SCROLLPOSITIONX':'0',
'__SCROLLPOSITIONY':'0',
'ctl00$Googlesearch$txtSearch':'',
'ctl00$ddlCurrency$hidCurrencyChange':'USD',
'ctl00$ContentMain$hdfMinPrice':'',
'ctl00$ContentMain$hdfMaxPrice':'',
'ctl00$ContentMain$ddlSort':'1',
'ctl00$ContentMain$hidMenu':'0',
'ctl00$ContentMain$hidSubMenu':'',
'ctl00$ContentMain$DestinationSearchBox1$arrivaldate':'06/23/2014',
'ctl00$ContentMain$DestinationSearchBox1$departdate':'06/25/2014',
'ctl00$ContentMain$DestinationSearchBox1$controlmode':'1',
'ctl00$ContentMain$DestinationSearchBox1$jsRooms':'0',
'ctl00$ContentMain$DestinationSearchBox1$jsAdults':'0',
'ctl00$ContentMain$DestinationSearchBox1$jsChildren':'0',
'ctl00$ContentMain$DestinationSearchBox1$SearchHotel':'no',
'ctl00$ContentMain$DestinationSearchBox1$ErrorCharLengthMessage':'Please enter at least the first two letters of the name you are looking for.',
'ctl00$ContentMain$DestinationSearchBox1$TextError':'Please enter the name of a Country, City, Airport, Area, Landmark or Hotel to proceed.',
'ctl00$ContentMain$DestinationSearchBox1$TextSearch1$tmptextDefault':'Country, City, Airport, Area, Landmark',
'ctl00$ContentMain$DestinationSearchBox1$TextSearch1$txtSearch':'Colombo',
'ctl00$ContentMain$DestinationSearchBox1$ddlDistance':'1',
'ddlCheckInDay':'23',
'ddlCheckInMonthYear':'6,2014',
'datepickerarrival':'',
'ddlCheckOutDay':'25',
'ddlCheckOutMonthYear':'6,2014',
'ctl00$ContentMain$DestinationSearchBox1$ddlNights':'2',
'datepickerdepart':'',
'ctl00$ContentMain$DestinationSearchBox1$ddlRoom':'1',
'ctl00$ContentMain$DestinationSearchBox1$ddlAdult':'2',
'ctl00$ContentMain$DestinationSearchBox1$ddlChildren':'0',
'ctl00$ContentMain$txtHotelName':'',
'ctl00$ContentMain$hidHotelList2603':'',
'ctl00$ContentMain$HotelFilterStarRating$HiddenFilterStatus':'',
'ctl00$ContentMain$HotelFilterFacilities$HiddenFilterStatus':'',
'ctl00$ContentMain$HotelFilterAccommodationType$HiddenFilterStatus':'',
'ctl00$ContentMain$HotelFilterArea$HiddenFilterStatus':'',
'ctl00$ContentMain$HotelFilterChainAndBrand$HiddenFilterStatus':'',
#'__ASYNCPOST':'true'
},
callback=self.parse,clickdata=None)
答案 0 :(得分:0)
即使您的POST标头错误,网站也可能会返回200 OK
状态。尝试使用scrapy shell
并使用您创建的表单数据提交FormRequest
,以查看该网站返回的内容。
我建议使用与此类似的东西,以避免键入每个标题并避免可能的错误:
formdata = {}
for hid in sel.xpath('//input[@type="hidden" and @value and @name]'):
formdata[hid.xpath('@name').extract()[0]] = hid.xpath('@value').extract()[0]