Webscrape多线程python 3

时间:2016-05-25 02:19:24

标签: python multithreading python-3.x web-scraping

我一直在使用一个简单的网页编写程序来学习如何编码并使其工作,但我想看看如何让它更快。我想问一下如何实现这个程序的多线程?程序所做的就是打开股票代码文件并在线搜索该股票的价格。

这是我的代码

import urllib.request
import urllib
from threading import Thread

symbolsfile = open("Stocklist.txt")

symbolslist = symbolsfile.read()

thesymbolslist = symbolslist.split("\n")

i=0


while i<len (thesymbolslist):
    theurl = "http://www.google.com/finance/getprices?q=" + thesymbolslist[i] + "&i=10&p=25m&f=c"
    thepage = urllib.request.urlopen(theurl)
    # read the correct character encoding from `Content-Type` request header
    charset_encoding = thepage.info().get_content_charset()
    # apply encoding
    thepage = thepage.read().decode(charset_encoding)
    print(thesymbolslist[i] + " price is " + thepage.split()[len(thepage.split())-1])
    i= i+1

2 个答案:

答案 0 :(得分:0)

如果你只是在列表上迭代一个函数,我建议你multiprocessing.Pool.map(function, list)

https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing%20map#multiprocessing.pool.Pool.map

答案 1 :(得分:0)

您需要使用asyncio。这个非常整洁的软件包也可以帮助你进行报废。我已经创建了一个如何使用asyncio进行integrate with linkedin的小片段,但您可以很轻松地将其用于满足您的需求。

import asyncio
import requests

def scrape_first_site():
    url = 'http://example.com/'
    response = requests.get(url)


def scrape_another_site():
    url = 'http://example.com/other/'
    response = requests.get(url)

loop = asyncio.get_event_loop()

tasks = [
    loop.run_in_executor(None, scrape_first_site),
    loop.run_in_executor(None, scrape_another_site)
]

loop.run_until_complete(asyncio.wait(tasks))
loop.close()

由于默认执行程序是ThreadPoolExecutor,它将在单独的线程中运行每个任务。如果您想在进程中运行任务而不是线程(可能与GIL相关的问题),则可以使用ProcessPoolExecutor。