使用线程原因" python.exe已停止工作"

时间:2015-08-16 21:06:55

标签: python multithreading python-2.7 web-scraping thread-safety

最近我尝试在我的刮刀上添加线程,以便在刮擦时提高效率。

但不知怎的,它随机导致python.exe停止工作"没有给出进一步的信息因此我不知道如何调试它。

以下是一些相关代码:

  1. 启动线程的位置:

    def run(self):
    """
    create the threads and run the scraper
    :return:
    """
    self.__load_resource()
    self.__prepare_threads_args() # each thread is allocated a different set of links to scrape from, these should be no collision.
    for item in self.threads_args:
        try:
            t = threading.Thread(target=self.urllib_method, args=(item,))
            # use the following expression to use the selenium scraper
            # t = threading.Thread(target=self.__scrape_site, args=(item,))
    
            self.threads.append(t)
            t.start()
        except Exception as ex:
            print ex
    
  2. 刮刀是什么样的:

    def urllib_method(self, thread_args):
    
    """
    :param thread_args:  arguments containing the  files to scrape and the proxy to use
    :return:
    """
    
    site_scraper = SiteScraper()
    for file in thread_args["files"]:
            current_folder_path = self.__prepare_output_folder(file["name"])
    
            articles_without_comments_file = os.path.join(current_folder_path, "articles_without_comments")
            articles_without_comments_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []
    
            articles_scraped_file = os.path.join(current_folder_path, "articles_scraped")
            articles_scraped_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []
    
            links = get_links_from_file(file["path"])
            for link in links:
                article_id = extract_article_id(link)
    
                if isfile(join(current_folder_path, article_id)):
                    print "skip: ", link
                    if link not in articles_scraped_links:
                        append_text_to_file(articles_scraped_file, link)
                    continue
                if link in articles_without_comments_links:
                    continue
    
                comments = site_scraper.call_comments_endpoint(article_id, thread_args["proxy"])
    
                if comments != "Pro article" and comments != "Crash" and comments != "No Comments" and comments is not None:
                    print article_id, comments[0:14]
                    write_text_to_file(os.path.join(current_folder_path, article_id), comments)
                    sleep(1)
                    append_text_to_file(articles_scraped_file, link)
                elif comments == "No Comments":
                    print "article without comments: ",  article_id
                    if link not in articles_without_comments_links:
                        append_text_to_file(articles_without_comments_file, link)
                    sleep(1)
    
  3. 我试图在Windows 10和8.1上运行该脚本,这两个问题都存在。

    此外,它抓取的数据越多,发生的频率就越高。并且使用的线程越多,发生的频率就越高。

1 个答案:

答案 0 :(得分:0)

由于恶魔般的全局解释器锁定,Python 3.2中的线程使用起来非常不安全。

在python中使用多个核心和进程的首选方法是通过多处理包。

https://docs.python.org/2/library/multiprocessing.html