Urllib2& BeautifulSoup:好夫妻,但太慢了 - urllib3&线程?

时间:2012-04-22 03:59:31

标签: python multithreading performance beautifulsoup urllib2

当我听到有关线程和urllib3的一些好消息时,我正在寻找一种优化代码的方法。显然,人们不同意哪种解决方案是最好的。

下面我的脚本的问题是执行时间:太慢了!

第1步:我获取此页面 http://www.cambridgeesol.org/institutions/results.php?region=Afghanistan&type=&BULATS=on

第2步:我使用BeautifulSoup解析页面

第3步:我将数据放在Excel文档中

第4步:我再次这样做,并再次为我列表中的所有国家/地区(大名单) (我只是将网址中的“阿富汗”改为另一个国家)

这是我的代码:

ws = wb.add_sheet("BULATS_IA") #We add a new tab in the excel doc
    x = 0 # We need x and y for pulling the data into the excel doc
    y = 0
    Countries_List = ['Afghanistan','Albania','Andorra','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia and Herzegovina','Brazil','Brunei Darussalam','Bulgaria','Cameroon','Canada','Central African Republic','Chile','China','Colombia','Costa Rica','Croatia','Cuba','Cyprus','Czech Republic','Denmark','Dominican Republic','Ecuador','Egypt','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Polynesia','Georgia','Germany','Gibraltar','Greece','Grenada','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (Burma)','Nepal','Netherlands','New Caledonia','New Zealand','Nigeria','Norway','Oman','Pakistan','Palestine','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal','Qatar','Romania','Russia','Saudi Arabia','Serbia','Singapore','Slovakia','Slovenia','South Africa','South Korea','Spain','Sri Lanka','Sweden','Switzerland','Syria','Taiwan','Thailand','Trinadad and Tobago','Tunisia','Turkey','Ukraine','United Arab Emirates','United Kingdom','United States','Uruguay','Uzbekistan','Venezuela','Vietnam']
    Longueur = len(Countries_List)



    for Countries in Countries_List:
        y = 0

        htmlSource = urllib.urlopen("http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % (Countries)).read() # I am opening the page with the name of the correspondant country in the url
        s = soup(htmlSource)
        tableGood = s.findAll('table')
        try:
            rows = tableGood[3].findAll('tr')
            for tr in rows:
                cols = tr.findAll('td')
                y = 0
                x = x + 1
                for td in cols:
                    hum =  td.text
                    ws.write(x,y,hum)
                    y = y + 1
                    wb.save("%s.xls" % name_excel)

        except (IndexError):
            pass

所以我知道一切都不完美,但我期待着用Python学习新东西!脚本非常慢,因为urllib2并不那么快,而且还有BeautifulSoup。对于汤的事情,我想我不能真正做到更好,但对于urllib2,我没有。

编辑1: Multiprocessing useless with urllib2? 在我的案例中似乎很有趣。 你们对这个潜在的解决方案有何看法?

# Make sure that the queue is thread-safe!!

def producer(self):
    # Only need one producer, although you could have multiple
    with fh = open('urllist.txt', 'r'):
        for line in fh:
            self.queue.enqueue(line.strip())

def consumer(self):
    # Fire up N of these babies for some speed
    while True:
        url = self.queue.dequeue()
        dh = urllib2.urlopen(url)
        with fh = open('/dev/null', 'w'): # gotta put it somewhere
            fh.write(dh.read())

编辑2:URLLIB3 任何人都能告诉我更多关于此的事情吗?

  

为多个请求重复使用相同的套接字连接   (HTTPConnectionPool和HTTPSConnectionPool)(带可选项   客户端证书验证)。   https://github.com/shazow/urllib3

至于我为不同的页面请求122次相同的网站,我想重用相同的套接字连接可能很有趣,我错了吗? 不能更快吗? ......

http = urllib3.PoolManager()
r = http.request('GET', 'http://www.bulats.org')
for Pages in Pages_List:
    r = http.request('GET', 'http://www.bulats.org/agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=%s' % (Pages))
    s = soup(r.data)

3 个答案:

答案 0 :(得分:9)

考虑使用workerpool之类的内容。参考Mass Downloader示例,结合urllib3看起来像:

import workerpool
import urllib3

URL_LIST = [] # Fill this from somewhere

NUM_SOCKETS = 3
NUM_WORKERS = 5

# We want a few more workers than sockets so that they have extra
# time to parse things and such.

http = urllib3.PoolManager(maxsize=NUM_SOCKETS)
workers = workerpool.WorkerPool(size=NUM_WORKERS)

class MyJob(workerpool.Job):
    def __init__(self, url):
       self.url = url

    def run(self):
        r = http.request('GET', self.url)
        # ... do parsing stuff here


for url in URL_LIST:
    workers.put(MyJob(url))

# Send shutdown jobs to all threads, and wait until all the jobs have been completed
# (If you don't do this, the script might hang due to a rogue undead thread.)
workers.shutdown()
workers.wait()

您可以从Mass Downloader示例中注意到,有多种方法可以执行此操作。我选择这个特殊的例子只是因为它不那么神奇,但任何其他策略都是有效的。

免责声明:我是urllib3和workerpool的作者。

答案 1 :(得分:2)

我不认为urllib或BeautifulSoup很慢。我使用修改后的版本在本地计算机上运行您的代码(删除了excel的东西)。打开连接,下载内容,解析内容并将其打印到国家/地区的控制台大约需要100毫秒。

10ms是BeautifulSoup用于解析内容并在每个国家/地区打印到控制台的总时间。这很快。

我都不相信使用Scrappy或线程会解决问题。因为问题是期望它会很快。

欢迎来到HTTP世界。它有时会很慢,有时它会非常快。几个缓慢的连接原因

  • 因为服务器处理您的请求(有时返回404)
  • DNS解析,
  • HTTP握手,
  • 您的ISP的连接稳定性,
  • 您的带宽率,
  • 丢包率

等。

不要忘记,您正在尝试向服务器发出121个HTTP请求,因此您不知道它们具有哪种服务器。他们也可能因为随后的电话而禁止你的IP地址。

看一下Requests lib。阅读他们的文档。如果你这样做是为了更多地学习Python,不要直接进入Scrapy。

答案 2 :(得分:0)

嘿伙计,

问题的一些消息!我找到了这个可能有用的脚本!我实际上正在测试它,它很有希望(6.03运行下面的脚本)。

我的想法是找到一种方法将其与urllib3混合使用。在effet中,我在同一台主机上做了很多次请求。

  

PoolManager将在任何时候为您重复使用连接   你请求同一个主机。这应该涵盖大多数情况   显着的效率损失,但你总是可以下降到一个   较低级别的组件,用于更精细的控制(urrlib3 doc site)

无论如何,它似乎非常有趣,如果我还不知道如何混合这两个功能(urllib3和下面的线程脚本),我想它是可行的! : - )

非常感谢您花时间与我握手,它味道很好!

import Queue
import threading
import urllib2
import time
from bs4 import BeautifulSoup as BeautifulSoup



hosts = ["http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=1", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=2", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=3", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=4", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=5", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=6"]

queue = Queue.Queue()
out_queue = Queue.Queue()

class ThreadUrl(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, queue, out_queue):
        threading.Thread.__init__(self)
        self.queue = queue
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            host = self.queue.get()

            #grabs urls of hosts and then grabs chunk of webpage
            url = urllib2.urlopen(host)
            chunk = url.read()

            #place chunk into out queue
            self.out_queue.put(chunk)

            #signals to queue job is done
            self.queue.task_done()

class DatamineThread(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, out_queue):
        threading.Thread.__init__(self)
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            chunk = self.out_queue.get()

            #parse the chunk
            soup = BeautifulSoup(chunk)
            #print soup.findAll(['table'])

            tableau = soup.find('table')
        rows = tableau.findAll('tr')
        for tr in rows:
            cols = tr.findAll('td')
            for td in cols:
                    texte_bu = td.text
                    texte_bu = texte_bu.encode('utf-8')
                    print texte_bu

            #signals to queue job is done
            self.out_queue.task_done()

start = time.time()
def main():

    #spawn a pool of threads, and pass them queue instance
    for i in range(5):
        t = ThreadUrl(queue, out_queue)
        t.setDaemon(True)
        t.start()

    #populate queue with data
    for host in hosts:
        queue.put(host)

    for i in range(5):
        dt = DatamineThread(out_queue)
        dt.setDaemon(True)
        dt.start()


    #wait on the queue until everything has been processed
    queue.join()
    out_queue.join()

main()
print "Elapsed Time: %s" % (time.time() - start)