我正在使用刮刀从http://www.johnlscott.com/agent-search.aspx爬到办公室名单。
办公室名单地址如下所示:http://www.johnlscott.com/agent-search.aspx?p=agentResults.asp&OfficeID=8627 - 但Scrapy抓取http://www.johnlscott.com/agent-search.aspx?OfficeID=8627&p=agentResults.asp这是一个死页。 交换.aspx之后的两个部分。
我甚至将每个地址明确地手动加载为start_urls,它仍然会发生。
我在python-2.7,Windows 8.1上使用最新的Scrapy
代码示例:
class JLSSpider(CrawlSpider):
name = 'JLS'
allowed_domains = ['johnlscott.com']
# start_urls = ['http://www.johnlscott.com/agent-search.aspx']
rules = (
Rule(callback="parse_start_url", follow=True),)
def start_requests(self):
with open('hrefnums.csv', 'rbU') as ifile:
read = csv.reader(ifile)
for row in read:
for col in row:
# I have a csv of the office IDs: (Just letting it crawl through them creates the same issue)
yield self.make_requests_from_url("http://www.johnlscott.com/agent-search.aspx?p=agentResults.asp&OfficeID=%s" % col)
def parse_start_url(self, response):
items = []
sel = Selector(response)
sections = sel.xpath("//tr/td/table[@id='tbAgents']/tr")
for section in sections:
item = JLSItem()
item['name'] = section.xpath("td[2]/text()")[0].extract().replace(u'\xa0', ' ').strip()
items.append(item)
return(items)
答案 0 :(得分:0)
没有像这样爬行的问题:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request
class JLSSpider(CrawlSpider):
name = 'JLS'
allowed_domains = ['johnlscott.com']
def start_requests(self):
yield Request("http://www.johnlscott.com/agent-search.aspx?p=agentResults.asp&OfficeID=8627", callback=self.parse_item)
def parse_item(self, response):
print response.body