为什么scrapy-plugins / scrapy-jsonrpc无法获得蜘蛛的统计数据

时间:2016-09-27 02:50:26

标签: scrapy

我只想监视我正在运行的蜘蛛的stats.I得到最新的scrapy-plugins / scrapy-jsonrpc并将蜘蛛设置如下:

EXTENSIONS = {
    'scrapy_jsonrpc.webservice.WebService': 500,
}

JSONRPC_ENABLED = True

JSONRPC_PORT = [60853]

但是当我浏览http://localhost:60853/时,它只会返回

{"resources": ["crawler"]}

我只能在没有统计数据的情况下获得正在运行的蜘蛛名称。 谁能告诉我,哪个地方我错了,谢谢!

1 个答案:

答案 0 :(得分:0)

http://localhost:60853/返回可用资源,/crawler是唯一的顶级资源。

如果您想获取蜘蛛的统计信息,则需要查询/crawler/stats端点并致电get_stats()

以下是使用python-jsonrpc:的示例(此处我将webservice配置为侦听localhost和端口6024)

>>> import pyjsonrpc
>>> http_client = pyjsonrpc.HttpClient('http://localhost:6024/crawler/stats')

>>> http_client.call('get_stats', 'httpbin')
{u'log_count/DEBUG': 4, u'scheduler/dequeued': 4, u'log_count/INFO': 9, u'downloader/response_count': 2, u'downloader/response_status_count/200': 2, u'log_count/WARNING': 1, u'scheduler/enqueued/memory': 4, u'downloader/response_bytes': 639, u'start_time': u'2016-09-28 08:49:57', u'scheduler/dequeued/memory': 4, u'scheduler/enqueued': 4, u'downloader/request_bytes': 862, u'response_received_count': 2, u'downloader/request_method_count/GET': 4, u'downloader/request_count': 4}

>>> http_client.call('get_stats')
{u'log_count/DEBUG': 4, u'scheduler/dequeued': 4, u'log_count/INFO': 9, u'downloader/response_count': 2, u'downloader/response_status_count/200': 2, u'log_count/WARNING': 1, u'scheduler/enqueued/memory': 4, u'downloader/response_bytes': 639, u'start_time': u'2016-09-28 08:49:57', u'scheduler/dequeued/memory': 4, u'scheduler/enqueued': 4, u'downloader/request_bytes': 862, u'response_received_count': 2, u'downloader/request_method_count/GET': 4, u'downloader/request_count': 4}
>>> from pprint import pprint
>>> pprint(http_client.call('get_stats'))
{u'downloader/request_bytes': 862,
 u'downloader/request_count': 4,
 u'downloader/request_method_count/GET': 4,
 u'downloader/response_bytes': 639,
 u'downloader/response_count': 2,
 u'downloader/response_status_count/200': 2,
 u'log_count/DEBUG': 4,
 u'log_count/INFO': 9,
 u'log_count/WARNING': 1,
 u'response_received_count': 2,
 u'scheduler/dequeued': 4,
 u'scheduler/dequeued/memory': 4,
 u'scheduler/enqueued': 4,
 u'scheduler/enqueued/memory': 4,
 u'start_time': u'2016-09-28 08:49:57'}
>>> 

您还可以使用jsonrpc_client_call中的scrapy_jsonrpc.jsonrpc

>>> from scrapy_jsonrpc.jsonrpc import jsonrpc_client_call
>>> jsonrpc_client_call('http://localhost:6024/crawler/stats', 'get_stats', 'httpbin')
{u'log_count/DEBUG': 5, u'scheduler/dequeued': 4, u'log_count/INFO': 11, u'downloader/response_count': 3, u'downloader/response_status_count/200': 3, u'log_count/WARNING': 1, u'scheduler/enqueued/memory': 4, u'downloader/response_bytes': 870, u'start_time': u'2016-09-28 09:01:47', u'scheduler/dequeued/memory': 4, u'scheduler/enqueued': 4, u'downloader/request_bytes': 862, u'response_received_count': 3, u'downloader/request_method_count/GET': 4, u'downloader/request_count': 4}

对于使用修改后的example-client.py发出的请求,您可以“通过网络获取”(请参阅​​下面的代码,https://github.com/scrapy-plugins/scrapy-jsonrpc中的示例在写入这些行时已过时):< / p>

POST /crawler/stats HTTP/1.1
Accept-Encoding: identity
Content-Length: 73
Host: localhost:6024
Content-Type: application/x-www-form-urlencoded
Connection: close
User-Agent: Python-urllib/2.7

{"params": ["httpbin"], "jsonrpc": "2.0", "method": "get_stats", "id": 1}

响应

HTTP/1.1 200 OK
Content-Length: 504
Access-Control-Allow-Headers:  X-Requested-With
Server: TwistedWeb/16.4.1
Connection: close
Date: Tue, 27 Sep 2016 11:21:43 GMT
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, PATCH, PUT, DELETE
Content-Type: application/json

{"jsonrpc": "2.0", "result": {"log_count/DEBUG": 5, "scheduler/dequeued": 4, "log_count/INFO": 11, "downloader/response_count": 3, "downloader/response_status_count/200": 3, "log_count/WARNING": 3, "scheduler/enqueued/memory": 4, "downloader/response_bytes": 870, "start_time": "2016-09-27 11:16:25", "scheduler/dequeued/memory": 4, "scheduler/enqueued": 4, "downloader/request_bytes": 862, "response_received_count": 3, "downloader/request_method_count/GET": 4, "downloader/request_count": 4}, "id": 1}

以下是用于查询/crawler/stats的修改后的客户端,我使用./example-client.py -H localhost -P 6024 get-spider-stats httpbin调用(对于正在运行的“httpbin”蜘蛛,JSONRPC_PORT对我来说是6024)

#!/usr/bin/env python
"""
Example script to control a Scrapy server using its JSON-RPC web service.

It only provides a reduced functionality as its main purpose is to illustrate
how to write a web service client. Feel free to improve or write you own.

Also, keep in mind that the JSON-RPC API is not stable. The recommended way for
controlling a Scrapy server is through the execution queue (see the "queue"
command).

"""

from __future__ import print_function
import sys, optparse, urllib, json
from six.moves.urllib.parse import urljoin

from scrapy_jsonrpc.jsonrpc import jsonrpc_client_call, JsonRpcError

def get_commands():
    return {
        'help': cmd_help,
        'stop': cmd_stop,
        'list-available': cmd_list_available,
        'list-running': cmd_list_running,
        'list-resources': cmd_list_resources,
        'get-global-stats': cmd_get_global_stats,
        'get-spider-stats': cmd_get_spider_stats,
    }

def cmd_help(args, opts):
    """help - list available commands"""
    print("Available commands:")
    for _, func in sorted(get_commands().items()):
        print("  ", func.__doc__)

def cmd_stop(args, opts):
    """stop <spider> - stop a running spider"""
    jsonrpc_call(opts, 'crawler/engine', 'close_spider', args[0])

def cmd_list_running(args, opts):
    """list-running - list running spiders"""
    for x in json_get(opts, 'crawler/engine/open_spiders'):
        print(x)

def cmd_list_available(args, opts):
    """list-available - list name of available spiders"""
    for x in jsonrpc_call(opts, 'crawler/spiders', 'list'):
        print(x)

def cmd_list_resources(args, opts):
    """list-resources - list available web service resources"""
    for x in json_get(opts, '')['resources']:
        print(x)

def cmd_get_spider_stats(args, opts):
    """get-spider-stats <spider> - get stats of a running spider"""
    stats = jsonrpc_call(opts, 'crawler/stats', 'get_stats', args[0])
    for name, value in stats.items():
        print("%-40s %s" % (name, value))

def cmd_get_global_stats(args, opts):
    """get-global-stats - get global stats"""
    stats = jsonrpc_call(opts, 'crawler/stats', 'get_stats')
    for name, value in stats.items():
        print("%-40s %s" % (name, value))

def get_wsurl(opts, path):
    return urljoin("http://%s:%s/"% (opts.host, opts.port), path)

def jsonrpc_call(opts, path, method, *args, **kwargs):
    url = get_wsurl(opts, path)
    return jsonrpc_client_call(url, method, *args, **kwargs)

def json_get(opts, path):
    url = get_wsurl(opts, path)
    return json.loads(urllib.urlopen(url).read())

def parse_opts():
    usage = "%prog [options] <command> [arg] ..."
    description = "Scrapy web service control script. Use '%prog help' " \
        "to see the list of available commands."
    op = optparse.OptionParser(usage=usage, description=description)
    op.add_option("-H", dest="host", default="localhost", \
        help="Scrapy host to connect to")
    op.add_option("-P", dest="port", type="int", default=6080, \
        help="Scrapy port to connect to")
    opts, args = op.parse_args()
    if not args:
        op.print_help()
        sys.exit(2)
    cmdname, cmdargs, opts = args[0], args[1:], opts
    commands = get_commands()
    if cmdname not in commands:
        sys.stderr.write("Unknown command: %s\n\n" % cmdname)
        cmd_help(None, None)
        sys.exit(1)
    return commands[cmdname], cmdargs, opts

def main():
    cmd, args, opts = parse_opts()
    try:
        cmd(args, opts)
    except IndexError:
        print(cmd.__doc__)
    except JsonRpcError as e:
        print(str(e))
        if e.data:
            print("Server Traceback below:")
            print(e.data)


if __name__ == '__main__':
    main()

在上面的示例命令中,我得到了这个:

log_count/DEBUG                          5
scheduler/dequeued                       4
log_count/INFO                           11
downloader/response_count                3
downloader/response_status_count/200     3
log_count/WARNING                        3
scheduler/enqueued/memory                4
downloader/response_bytes                870
start_time                               2016-09-27 11:16:25
scheduler/dequeued/memory                4
scheduler/enqueued                       4
downloader/request_bytes                 862
response_received_count                  3
downloader/request_method_count/GET      4
downloader/request_count                 4