最近我尝试在我的刮刀上添加线程,以便在刮擦时提高效率。
但不知怎的,它随机导致python.exe停止工作"没有给出进一步的信息因此我不知道如何调试它。
以下是一些相关代码:
启动线程的位置:
def run(self):
"""
create the threads and run the scraper
:return:
"""
self.__load_resource()
self.__prepare_threads_args() # each thread is allocated a different set of links to scrape from, these should be no collision.
for item in self.threads_args:
try:
t = threading.Thread(target=self.urllib_method, args=(item,))
# use the following expression to use the selenium scraper
# t = threading.Thread(target=self.__scrape_site, args=(item,))
self.threads.append(t)
t.start()
except Exception as ex:
print ex
刮刀是什么样的:
def urllib_method(self, thread_args):
"""
:param thread_args: arguments containing the files to scrape and the proxy to use
:return:
"""
site_scraper = SiteScraper()
for file in thread_args["files"]:
current_folder_path = self.__prepare_output_folder(file["name"])
articles_without_comments_file = os.path.join(current_folder_path, "articles_without_comments")
articles_without_comments_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []
articles_scraped_file = os.path.join(current_folder_path, "articles_scraped")
articles_scraped_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []
links = get_links_from_file(file["path"])
for link in links:
article_id = extract_article_id(link)
if isfile(join(current_folder_path, article_id)):
print "skip: ", link
if link not in articles_scraped_links:
append_text_to_file(articles_scraped_file, link)
continue
if link in articles_without_comments_links:
continue
comments = site_scraper.call_comments_endpoint(article_id, thread_args["proxy"])
if comments != "Pro article" and comments != "Crash" and comments != "No Comments" and comments is not None:
print article_id, comments[0:14]
write_text_to_file(os.path.join(current_folder_path, article_id), comments)
sleep(1)
append_text_to_file(articles_scraped_file, link)
elif comments == "No Comments":
print "article without comments: ", article_id
if link not in articles_without_comments_links:
append_text_to_file(articles_without_comments_file, link)
sleep(1)
我试图在Windows 10和8.1上运行该脚本,这两个问题都存在。
此外,它抓取的数据越多,发生的频率就越高。并且使用的线程越多,发生的频率就越高。
答案 0 :(得分:0)
由于恶魔般的全局解释器锁定,Python 3.2中的线程使用起来非常不安全。
在python中使用多个核心和进程的首选方法是通过多处理包。