我正在尝试使用scrapy刮取网站, 我的蜘蛛如下:
class mySpider(CrawlSpider):
name = "mytest"
allowed_domains = {'www.example.com'}
start_urls = ['http://www.example.com']
rules = [
Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback = 'parse_post',
follow= True)
]
def parse_post(self, response):
item = PostItem()
item['url'] = response.url
item['title'] = response.xpath('//title/text()').extract()
item['authors'] = response.xpath('//span[@class="author"]/text()').extract()
return item
一切正常但它只会刮擦主页中的链接。它允许加载更多带有帖子请求的文章,即“点击更多文章”。 无论如何我可以模拟加载更多文章按钮来加载文章并继续刮刀吗?
答案 0 :(得分:2)
“加载更多文章”按钮由javascript管理,点击ti即可启动AJAX帖子请求。
换句话说,这是Scrapy
无法轻易处理的事情。
但是,如果Scrapy
不是必需的,那么这是使用requests
和BeautifulSoup
的解决方案:
from bs4 import BeautifulSoup
import requests
url = "http://www.ijreview.com/wp-admin/admin-ajax.php"
session = requests.Session()
page_size = 24
params = {
'action': 'load_more',
'numPosts': page_size,
'category': '',
'orderby': 'date',
'time': ''
}
offset = 0
limit = 100
while offset < limit:
params['offset'] = offset
response = session.post(url, data=params)
links = [a['href'] for a in BeautifulSoup(response.content).select('li > a')]
for link in links:
response = session.get(link)
page = BeautifulSoup(response.content)
title = page.find('title').text.strip()
author = page.find('span', class_='author').text.strip()
print {'link': link, 'title': title, 'author': author}
offset += page_size
打印:
{'author': u'Kevin Boyd', 'link': 'http://www.ijreview.com/2014/08/172770-president-obama-realizes-world-messy-place-thanks-social-media/', 'title': u'President Obama Calls The World A Messy Place & Blames Social Media for Making People Take Notice'}
{'author': u'Reid Mene', 'link': 'http://www.ijreview.com/2014/08/172405-17-politicians-weird-jobs-time-office/', 'title': u'12 Most Unusual Professions of Politicians Before They Were Elected to Higher Office'}
{'author': u'Michael Hausam', 'link': 'http://www.ijreview.com/2014/08/172653-video-duty-mp-fakes-surrender-shoots-hostage-taker/', 'title': u'Video: Off-Duty MP Fake Surrenders at Gas Station Before Revealing Deadly Surprise for Hostage Taker'}
...
您可能需要调整代码,以便它支持不同的类别,排序等。您还可以通过允许BeautifulSoup
使用lxml
解析器来改进html解析速度 - 而不是BeautifulSoup(response.content)
的{{1}},请使用BeautifulSoup(response.content, "lxml")
,但您需要安装lxml
。
这是你如何调整Scrapy的解决方案:
import urllib
from scrapy import Item, Field, Request, Spider
class mySpider(Spider):
name = "mytest"
allowed_domains = {'www.ijreview.com'}
def start_requests(self):
page_size = 25
headers = {'User-Agent': 'Scrapy spider',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.ijreview.com',
'Origin': 'http://www.ijreview.com',
'Accept': '*/*',
'Referer': 'http://www.ijreview.com/',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}
for offset in (0, 200, page_size):
yield Request('http://www.ijreview.com/wp-admin/admin-ajax.php',
method='POST',
headers=headers,
body=urllib.urlencode(
{'action': 'load_more',
'numPosts': page_size,
'offset': offset,
'category': '',
'orderby': 'date',
'time': ''}))
def parse(self, response):
for link in response.xpath('//ul/li/a/@href').extract():
yield Request(link, callback=self.parse_post)
def parse_post(self, response):
item = PostItem()
item['url'] = response.url
item['title'] = response.xpath('//title/text()').extract()[0].strip()
item['authors'] = response.xpath('//span[@class="author"]/text()').extract()[0].strip()
return item
输出:
{'authors': u'Kyle Becker',
'title': u'17 Reactions to the \u2018We Don\u2019t Have a Strategy\u2019 Gaffe That May Haunt the Rest of Obama\u2019s Presidency',
'url': 'http://www.ijreview.com/2014/08/172569-25-reactions-obamas-dont-strategy-gaffe-may-haunt-rest-presidency/'}
...