Question

最近我尝试在我的刮刀上添加线程，以便在刮擦时提高效率。

但不知怎的，它随机导致python.exe停止工作＆＃34;没有给出进一步的信息因此我不知道如何调试它。

以下是一些相关代码：

启动线程的位置：

def run(self):
"""
create the threads and run the scraper
:return:
"""
self.__load_resource()
self.__prepare_threads_args() # each thread is allocated a different set of links to scrape from, these should be no collision.
for item in self.threads_args:
    try:
        t = threading.Thread(target=self.urllib_method, args=(item,))
        # use the following expression to use the selenium scraper
        # t = threading.Thread(target=self.__scrape_site, args=(item,))

        self.threads.append(t)
        t.start()
    except Exception as ex:
        print ex

刮刀是什么样的：

def urllib_method(self, thread_args):

"""
:param thread_args:  arguments containing the  files to scrape and the proxy to use
:return:
"""

site_scraper = SiteScraper()
for file in thread_args["files"]:
        current_folder_path = self.__prepare_output_folder(file["name"])

        articles_without_comments_file = os.path.join(current_folder_path, "articles_without_comments")
        articles_without_comments_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []

        articles_scraped_file = os.path.join(current_folder_path, "articles_scraped")
        articles_scraped_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []

        links = get_links_from_file(file["path"])
        for link in links:
            article_id = extract_article_id(link)

            if isfile(join(current_folder_path, article_id)):
                print "skip: ", link
                if link not in articles_scraped_links:
                    append_text_to_file(articles_scraped_file, link)
                continue
            if link in articles_without_comments_links:
                continue

            comments = site_scraper.call_comments_endpoint(article_id, thread_args["proxy"])

            if comments != "Pro article" and comments != "Crash" and comments != "No Comments" and comments is not None:
                print article_id, comments[0:14]
                write_text_to_file(os.path.join(current_folder_path, article_id), comments)
                sleep(1)
                append_text_to_file(articles_scraped_file, link)
            elif comments == "No Comments":
                print "article without comments: ",  article_id
                if link not in articles_without_comments_links:
                    append_text_to_file(articles_without_comments_file, link)
                sleep(1)

我试图在Windows 10和8.1上运行该脚本，这两个问题都存在。

此外，它抓取的数据越多，发生的频率就越高。并且使用的线程越多，发生的频率就越高。

Answer 1

由于恶魔般的全局解释器锁定，Python 3.2中的线程使用起来非常不安全。

在python中使用多个核心和进程的首选方法是通过多处理包。

https://docs.python.org/2/library/multiprocessing.html

使用线程原因＆＃34; python.exe已停止工作＆＃34;

1 个答案: