[scrapy&蟒]
我想follow
和extract
位于links
的所有xpath (//div[@class="work_area_content"]/a')
,并使用相同的xpath遍历所有链接,直到每个链接的最深层。我尝试过使用下面的代码:但是,它只通过主要图层,并不会跟随每个链接。
我觉得它与列表中没有值的links
变量有关。不知道为什么列表是空的。
class DatabloggerSpider(CrawlSpider):
# The name of the spider
name = "jobs"
# The domains that are allowed (links to other domains are skipped)
allowed_domains = ['1.1.1.1']
# The URLs to start with
start_urls = ['1.1.1.1/TestSuites']
# Method for parsing items
def parse(self, response):
# The list of items that are found on the particular page
items = []
# Only extract canonicalized and unique links (with respect to the current page)
test_str = response.text
# Removes string between two placeholders with regex
regex = r"(Back to)(.|\n)*?<br><br>"
regex_response = re.sub(regex, "", test_str)
regex_response2 = HtmlResponse(regex_response) ##TODO: fix here!
#print(regex_response2)
links = LinkExtractor(canonicalize=True, unique=True, restrict_xpaths = ('//div[@class="work_area_content"]/a')).extract_links(regex_response2)
print(type(links))
# #Now go through all the found links
print(links)
for link in links:
item = DatabloggerScraperItem()
item['url_from'] = response.url
item['url_to'] = link.url
items.append(item)
print(items)
yield scrapy.Request(links, callback=self.parse, dont_filter=True)
#Return all the found items
return items
答案 0 :(得分:1)
我认为您应该使用SgmlLinkExtractor follow=True
参数设置。
类似的东西:
links = SgmlLinkExtractor(follow=True, restrict_xpaths = ('//div[@class="work_area_content"]/a')).extract_links(regex_response2))
由于您使用的是CrawlSpider,因此您应该定义规则,请查看this blog post here以获取完整的示例。
答案 1 :(得分:-1)
以下是蜘蛛的示例,它跟随并提取网站中的所有链接,我们也可以进行分页
def parse(self, response):
for href in response.css('h2.seotitle > a::attr(href)'):
url = response.urljoin(href.extract())
yield scrapy.Request(url =url, callback = self.parse_details)
next_page_url = response.css('ul.pager').xpath('//a[contains(text(), "Next")]/@althref').extract_first()
print next_page_url
if next_page_url:
nextpage = response.css('ul.pager').xpath('//a[contains(text(), "Next")]/@onclick').extract_first()
searchresult_num = nextpage.split("'")[1].strip()
next_page_url = "http://jobsearch.monsterindia.com/searchresult.html?day=1&n="+searchresult_num
next_page_url = response.urljoin(next_page_url)
print next_page_url
yield scrapy.Request(url = next_page_url, callback = self.parse)
def parse_details(self,response):