我正在创建一个网络爬虫,该爬虫将从不同线程中的多个域中抓取。由于存在许多不同的域,因此我希望能够在每个线程中搜索记录的信息。
更新:用代码实现的解决方案。遵循#SOLUTION行
该脚本的设置如下:
import logging
from queue import Queue, Empty
from threading import current_thread # SOLUTION
from concurrent.futures import ThreadPoolExecutor
logging.basicConfig(
format='%(threadName)s %(levelname)s: %(message)s',
level=logging.INFO
)
class Scraper:
def __init__(self, max_workers):
self.pool = ThreadPoolExecutor(max_workers = max_workers, thread_name_prefix='T')
self.to_crawl = Queue()
for task in self.setup_tasks(tasks=max_workers):
logging.info('Putting task to queue:\n{}'.format(task))
self.to_crawl.put(task)
logging.info('Queue size after init: {}'.format(self.to_crawl.qsize()))
def setup_tasks(self, cur, tasks=1):
# Prepare tasks for the queue
def run_task(self, task):
# Function for executing the task
current_thread().name = task['id'] # SOLUTION
logging.info('Executing task:\n{}'.format(task))
id = task['id'] # I want the task id to be reflected in the logging function for when run_task runds
def run_scraper(self):
while True:
logging.info('Launching new thread, queue size is {}'.format(self.to_crawl.qsize()))
try:
task = self.to_crawl.get()
self.pool.submit(self.run_task, task)
except Empty:
break
if __name__ == '__main__':
s = Scraper(max_workers=3)
s.run_scraper()
我想将task['id']
添加到日志记录格式配置中,而不是给定的%(threadName)s
,而无需每次脚本在run_task
中记录某些内容时手动进行操作
当线程在task['id']
中执行任务时,是否可以将%(threadName)s
分配给线程run_scraper
?