我有spider(点击查看来源),非常适合常规的html页面抓取。 但是,我想添加一个额外的功能。我想解析一个JSON页面。
这是我想要做的(这里是手动完成的,没有scrapy):
import requests, json
import datetime
def main():
user_agent = {
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
}
# This is the URL that outputs JSON:
externalj = 'http://www.thestudentroom.co.uk/externaljson.php?&s='
# Form the end of the URL, it is based on the time (unixtime):
past = datetime.datetime.now() - datetime.timedelta(minutes=15)
time = past.strftime('%s')
# This is the full URL:
url = externalj + time
# Make the HTTP get request:
tsr_data = requests.get(url, headers= user_agent).json()
# Iterate over the json data and form the URLs
# (there are no URLs at all in the JSON data, they must be formed manually):
# URL is formed simply by concatenating the canonical link with a thread-id:
for post in tsr_data['discussions-recent']:
link= 'www.thestudentroom.co.uk/showthread.php?t='
return link + post['threadid']
此函数将返回我想要抓取的HTML页面(论坛帖子的链接)的正确链接。我似乎需要创建自己的请求对象才能发送到spider中的parse_link
?
我的问题是,我在哪里放这个代码?我很困惑如何将其纳入scrapy?我需要创建另一个蜘蛛吗?
理想情况下,我希望它可以与the spider that I already have一起使用,但不确定这是否可行。
如何在scrapy中实现这一点非常困惑。我希望有人可以提供建议!
我现在的蜘蛛是这样的:
import scrapy
from tutorial.items import TsrItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class TsrSpider(CrawlSpider):
name = 'tsr'
allowed_domains = ['thestudentroom.co.uk']
start_urls = ['http://www.thestudentroom.co.uk/forumdisplay.php?f=89']
download_delay = 2
user_agent = 'youruseragenthere'
thread_xpaths = ("//tr[@class='thread unread ']",
"//*[@id='discussions-recent']/li/a",
"//*[@id='discussions-popular']/li/a")
rules = [
Rule(LinkExtractor(allow=('showthread\.php\?t=\d+',),
restrict_xpaths=thread_xpaths),
callback='parse_link', follow=True),]
def parse_link(self, response):
for sel in response.xpath("//li[@class='post threadpost old ']"):
item = TsrItem()
item['id'] = sel.xpath(
"div[@class='post-header']//li[@class='post-number museo']/a/span/text()").extract()
item['rating'] = sel.xpath(
"div[@class='post-footer']//span[@class='score']/text()").extract()
item['post'] = sel.xpath(
"div[@class='post-content']/blockquote[@class='postcontent restore']/text()").extract()
item['link'] = response.url
item['topic'] = response.xpath(
"//div[@class='forum-header section-header']/h1/span/text()").extract()
yield item
答案 0 :(得分:1)
似乎我找到了一种让它工作的方法。也许我原来的帖子不清楚。
我想解析一个JSON响应,然后发送一个请求,以便scrapy进一步处理。
我在蜘蛛上添加了以下内容:
# A request object is required.
from scrapy.http import Request
和
def parse_start_url(self, response):
if 'externaljson.php' in str(response.url):
return self.make_json_links(response)
parse_start_url
似乎就像它说的那样。它解析了初始网址(开始网址)。此处只应处理JSON页面。
因此我需要使用我的html网址添加我的特殊JSON网址:
start_urls = ['http://tsr.com/externaljson.php', 'http://tsr.com/thread.html']
我现在需要从JSON页面的响应中以请求的形式生成URL:
def make_json_links(self, response):
''' Creates requests from JSON page. '''
data = json.loads(response.body_as_unicode())
for post in data['discussions-recent']:
link = 'http://www.tsr.co.uk/showthread.php?t='
full_link = link + str(post['threadid'])
json_request = Request(url=full_link)
return json_request
现在似乎有效。但是,我确信这是一种实现这一目标的hacky和不优雅的方式。不知何故感觉不对。
它似乎工作,它遵循我从JSON页面做的所有链接。我也不确定我是否应该在那里的某个地方使用yield
代替return
...
答案 1 :(得分:0)
链接是否始终遵循相同的格式?是否无法为JSON链接创建新规则,并使用单独的parse_json
函数作为回调?