多进程Luigi任务中的请求

时间:2017-01-22 18:01:40

标签: python multithreading elasticsearch python-requests luigi

我有一个简单的Luigi Elasticsearch indexing task,它使用Requests进行GET并将响应推送到本地ElasticSearch。另外,我做了第二次调用前几次的任务,如下所示:

import luigi
import requests
from luigi.contrib.esindex import CopyToIndex


class RequestTask(CopyToIndex):
    TEST_URL = 'http://www.this-page-intentionally-left-blank.org'
    index = 'example_index'
    iteration = luigi.IntParameter()

    def docs(self):
            res = requests.get(self.TEST_URL).content.decode('utf-8')
            return [{'response': res, 'iteration': self.iteration}]


class ManyRequests(luigi.Task):
    def requires(self):
        return [RequestTask(iteration) for iteration in range(0, 4)]

if __name__ == '__main__':
    luigi.run()

如果我在单线程中运行ManyRequests任务,可以正常运行。但是,如果我指定了一些工作人员(例如 - 工作人员4),则进程将从Elasticsearch引发 TransportError(index_already_exists_exception),并且未能正确完成 。完成的进程数是随机的,所以我认为这是由于在Elasticsearch数据库中写入了一些冲突。我是否必须以不同的方式实现ManyRequests?

任何帮助都将非常感激:)

当我执行ManyRequests --workers 4:

时,这是我的控制台
DEBUG: Checking if RequestTask(iteration=0) is complete
GET http://localhost:9200/update_log/entry/f55cf781cd5b4ff6be1454bc7fc624f874dea7ee [status:404 request:0.082s]
DEBUG: Marker document not found.
DEBUG: Checking if RequestTask(iteration=1) is complete
GET http://localhost:9200/update_log/entry/91af5a96a3e588ae318e996fd64add17465352b3 [status:404 request:0.020s]
DEBUG: Marker document not found.
DEBUG: Checking if RequestTask(iteration=2) is complete
GET http://localhost:9200/update_log/entry/41bb5cbca30df86d0815ec090b4d2fb20f2700d2 [status:404 request:0.051s]
DEBUG: Marker document not found.
DEBUG: Checking if RequestTask(iteration=3) is complete
GET http://localhost:9200/update_log/entry/d2dbeeca292ec62688a993c3b147272af2ba6a92 [status:404 request:0.061s]
DEBUG: Marker document not found.
INFO: Informed scheduler that task   ManyRequests__99914b932b   has status   PENDING
INFO: Informed scheduler that task   RequestTask_3_8a58dae6a3   has status   PENDING
INFO: Informed scheduler that task   RequestTask_2_eee8bd7963   has status   PENDING
INFO: Informed scheduler that task   RequestTask_1_020ce0ec4d   has status   PENDING
INFO: Informed scheduler that task   RequestTask_0_630962ba24   has status   PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 4 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 5
DEBUG: Asking scheduler for work...
INFO: [pid 3211] Worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210) running   RequestTask(iteration=3)
DEBUG: Pending tasks: 4
DEBUG: Asking scheduler for work...
INFO: [pid 3212] Worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210) running   RequestTask(iteration=2)
DEBUG: Pending tasks: 3
DEBUG: Asking scheduler for work...
INFO: [pid 3213] Worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210) running   RequestTask(iteration=0)
DEBUG: Pending tasks: 2
DEBUG: 4 running tasks, waiting for next task to finish
INFO: [pid 3214] Worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210) running   RequestTask(iteration=1)
DEBUG: 4 running tasks, waiting for next task to finish
PUT http://localhost:9200/example_index [status:400 request:0.514s]
PUT http://localhost:9200/example_index [status:400 request:0.517s]
PUT http://localhost:9200/example_index [status:400 request:0.520s]
ERROR: [pid 3213] Worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210) failed    RequestTask(iteration=0)
Traceback (most recent call last):
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/luigi/worker.py", line 192, in run
    new_deps = self._run_get_new_deps()
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/luigi/worker.py", line 130, in _run_get_new_deps
    task_gen = self.task.run()
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/luigi/contrib/esindex.py", line 448, in run
    self.create_index()
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/luigi/contrib/esindex.py", line 399, in create_index
    es.indices.create(index=self.index, body=self.settings)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/client/utils.py", line 71, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/client/indices.py", line 107, in create
    params=params, body=body)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/transport.py", line 318, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/connection/http_urllib3.py", line 127, in perform_request
    self._raise_error(response.status, raw_data)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/connection/base.py", line 122, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: TransportError(400, 'index_already_exists_exception', 'index [example_index/PpySzpJ-QiSLNupQrmdVjg] already exists')ERROR: [pid 3212] Worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210) failed    RequestTask(iteration=2)
Traceback (most recent call last):
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/luigi/worker.py", line 192, in run
    new_deps = self._run_get_new_deps()
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/luigi/worker.py", line 130, in _run_get_new_deps
    task_gen = self.task.run()
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/luigi/contrib/esindex.py", line 448, in run
    self.create_index()
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/luigi/contrib/esindex.py", line 399, in create_index
    es.indices.create(index=self.index, body=self.settings)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/client/utils.py", line 71, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/client/indices.py", line 107, in create
    params=params, body=body)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/transport.py", line 318, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/connection/http_urllib3.py", line 127, in perform_request
    self._raise_error(response.status, raw_data)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/connection/base.py", line 122, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: TransportError(400, 'index_already_exists_exception', 'index [example_index/PpySzpJ-QiSLNupQrmdVjg] already exists')

ERROR: [pid 3214] Worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210) failed    RequestTask(iteration=1)
Traceback (most recent call last):
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/luigi/worker.py", line 192, in run
    new_deps = self._run_get_new_deps()
  File "/Users/jgc/dev/upm/tfg/TFG-JorgeGarciaCastano/env/lib/python3.5/site-packages/luigi/worker.py", line 130, in _run_get_new_deps
    task_gen = self.task.run()
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/luigi/contrib/esindex.py", line 448, in run
    self.create_index()
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/luigi/contrib/esindex.py", line 399, in create_index
    es.indices.create(index=self.index, body=self.settings)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/client/utils.py", line 71, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/client/indices.py", line 107, in create
    params=params, body=body)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/transport.py", line 318, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/connection/http_urllib3.py", line 127, in perform_request
    self._raise_error(response.status, raw_data)
  File "/Users/jgc/dev/env/lib/python3.5/site-packages/elasticsearch/connection/base.py", line 122, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: TransportError(400, 'index_already_exists_exception', 'index [example_index/PpySzpJ-QiSLNupQrmdVjg] already exists')
INFO: Informed scheduler that task   RequestTask_2_eee8bd7963   has status   FAILED
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: RequestTask_3_8a58dae6a3 is currently run by worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210)
DEBUG: RequestTask_1_020ce0ec4d is currently run by worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210)
DEBUG: RequestTask_0_630962ba24 is currently run by worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210)
INFO: Informed scheduler that task   RequestTask_1_020ce0ec4d   has status   FAILED
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: RequestTask_3_8a58dae6a3 is currently run by worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210)
DEBUG: RequestTask_0_630962ba24 is currently run by worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210)
INFO: Informed scheduler that task   RequestTask_0_630962ba24   has status   FAILED
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: RequestTask_3_8a58dae6a3 is currently run by worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210)
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: RequestTask_3_8a58dae6a3 is currently run by worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210)
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: RequestTask_3_8a58dae6a3 is currently run by worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210)
INFO: [pid 3211] Worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210) done      RequestTask(iteration=3)
INFO: Informed scheduler that task   RequestTask_3_8a58dae6a3   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: There are 4 pending tasks possibly being run by other workers
DEBUG: There are 4 pending tasks unique to this worker
DEBUG: There are 4 pending tasks last scheduled by this worker
INFO: Worker Worker(salt=582258671, workers=4, host=jgc.local, username=jgc, pid=3210) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====

Scheduled 5 tasks of which:
* 1 ran successfully:
    - 1 RequestTask(iteration=3)
* 3 failed:
    - 3 RequestTask(iteration=0,1,2)
* 1 were left pending, among these:
    * 1 had failed dependencies:
        - 1 ManyRequests()

This progress looks :( because there were failed tasks

===== Luigi Execution Summary =====

0 个答案:

没有答案