Question

正如标题所示，我正在开发一个用python编写的网站，它会多次调用urllib2模块来读取网站。然后我用BeautifulSoup解析它们。

由于我必须阅读5-10个网站，因此页面需要一段时间才能加载。

我只是想知道是否有办法一次性阅读这些网站？或者是为了让它变得更快，例如我应该在每次阅读后关闭urllib2.urlopen，还是让它保持打开状态？

添加：另外，如果我只是切换到php，那么从其他网站获取和Parsi g HTML和XML文件会更快吗？我只是希望它加载更快，而不是目前需要的~20秒

Answer 1

我正在使用现代Python模块（例如threading和Queue）重写Dumb Guy的代码。

import threading, urllib2
import Queue

urls_to_load = [
'http://stackoverflow.com/',
'http://slashdot.org/',
'http://www.archive.org/',
'http://www.yahoo.co.jp/',
]

def read_url(url, queue):
    data = urllib2.urlopen(url).read()
    print('Fetched %s from %s' % (len(data), url))
    queue.put(data)

def fetch_parallel():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def fetch_sequencial():
    result = Queue.Queue()
    for url in urls_to_load:
        read_url(url,result)
    return result

find_sequencial()的最佳时间是2秒。 fetch_parallel()的最佳时间是0.9秒。

由于GIL，thread在Python中无用是不正确的。这是线程在Python中有用的情况之一，因为线程在I / O上被阻塞。正如您在我的结果中所看到的，并行情况要快2倍。

Answer 2

编辑：请查看Wai的帖子以获取此代码的更好版本。请注意，此代码没有任何问题，可以正常运行，尽管有以下评论。

阅读网页的速度可能受到互联网连接的限制，而不是Python。

您可以使用线程一次性加载它们。

import thread, time, urllib
websites = {}
def read_url(url):
  websites[url] = urllib.open(url).read()

for url in urls_to_load: thread.start_new_thread(read_url, (url,))
while websites.keys() != urls_to_load: time.sleep(0.1)

# Now websites will contain the contents of all the web pages in urls_to_load

Answer 3

这可能不完美。但是当我需要来自网站的数据时。我这样做：

import socket
def geturldata(url):
    #NO HTTP URLS PLEASE!!!!! 
    server = url.split("/")[0]
    args = url.replace(server,"")
    returndata = str()
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect((server, 80)) #lets connect :p

    s.send("GET /%s HTTP/1.0\r\nHost: %s\r\n\r\n" % (args, server)) #simple http request
    while 1:
        data = s.recv(1024) #buffer
        if not data: break
        returndata = returndata + data
    s.close()
    return returndata.split("\n\r")[1]

Answer 4

作为一般规则，任何语言的给定构造在测量之前都不会缓慢。

在Python中，时间不仅经常与直觉相悖，而且tools for measuring execution time非常好。

Answer 5

Scrapy可能对您有用。如果您不需要其所有功能，则可以使用twisted的twisted.web.client.getPage代替。与使用多个线程和阻塞IO的任何东西相比，一个线程中的异步IO将更具性能且易于调试。

Answer 6

不确定为什么没有人提到multiprocessing（如果有人知道为什么这可能是一个坏主意，请告诉我）：

import multiprocessing
from urllib2 import urlopen

URLS = [....]

def get_content(url):
    return urlopen(url).read()


pool = multiprocessing.Pool(processes=8)  # play with ``processes`` for best results
results = pool.map(get_content, URLS) # This line blocks, look at map_async 
                                      # for non-blocking map() call
pool.close()  # the process pool no longer accepts new tasks
pool.join()   # join the processes: this blocks until all URLs are processed
for result in results:
   # do something

multiprocessing个游泳池有一些警告。首先，与线程不同，这些是全新的Python进程（解释器）。虽然它不受全局解释器锁定的限制，但这意味着您可以通过新的流程进行限制。

您无法传递动态定义的lambdas和函数。必须在模块中以允许其他进程导入它的方式定义map()调用中使用的函数。

Pool.map()，这是同时处理多个任务的最直接的方法，并没有提供传递多个参数的方法，因此您可能需要编写包装函数或更改函数签名，和/或者传递多个参数作为被映射的iterable的一部分。

您不能让子进程生成新进程。只有父级才能生成子进程。这意味着您必须仔细计划和基准测试（有时编写代码的多个版本），以确定最有效地使用流程的方式。

尽管有缺点，我发现多处理是最简单的并发阻塞调用方式之一。您还可以组合多处理和线程（afaik，但如果我错了，请纠正我），或者将多处理与绿色线程结合起来。

Answer 7

1）您是多次打开同一个网站，还是许多不同的网站？如果有很多不同的网站，我认为urllib2很好。如果一遍又一遍地做同一个网站，我和urllib3有一些个人好运http://code.google.com/p/urllib3/

2）BeautifulSoup易于使用，但速度很慢。如果你必须使用它，请确保分解你的标签以摆脱内存泄漏..或者它可能会导致内存问题（对我而言）。

你的记忆和cpu是什么样的？如果你最大限度地使用CPU，请确保使用真正的重量级线程，这样就可以运行超过1个核心。

Answer 8

如何使用pycurl？

你可以通过

来获取它

$ sudo apt-get python-pycurl

Answer 9

首先，您应该尝试多线程/多处理包。目前，三个流行的是multiprocessing; concurrent.futures和[线程] [3]。这些包可以帮助您同时打开多个URL，这可以提高速度。

更重要的是，在使用多线程处理之后，如果你试图同时打开数百个网址，你会发现urllib.request.urlopen非常慢，打开和阅读上下文成为最耗时的部分。因此，如果你想让它更快，你应该尝试请求包，requests.get（url）.content（）比urllib.request.urlopen（url）.read（）更快。

所以，这里我列出了两个做快速多URL解析的例子，速度比其他答案快。第一个示例使用经典线程包并同时生成数百个线程。（一个微不足道的缺点是它无法保持自动收报机的原始顺序。）

import time
import threading
import pandas as pd
import requests
from bs4 import BeautifulSoup


ticker = pd.ExcelFile('short_tickerlist.xlsx')
ticker_df = ticker.parse(str(ticker.sheet_names[0]))
ticker_list = list(ticker_df['Ticker'])

start = time.time()

result = []
def fetch(ticker):
    url = ('http://finance.yahoo.com/quote/' + ticker)
    print('Visit ' + url)
    text = requests.get(url).content
    soup = BeautifulSoup(text,'lxml')
    result.append([ticker,soup])
    print(url +' fetching...... ' + str(time.time()-start))



if __name__ == '__main__':
    process = [None] * len(ticker_list)
    for i in range(len(ticker_list)):
        process[i] = threading.Thread(target=fetch, args=[ticker_list[i]])

    for i in range(len(ticker_list)):    
        print('Start_' + str(i))
        process[i].start()



    # for i in range(len(ticker_list)):
    #     print('Join_' + str(i))    
    #     process[i].join()

    print("Elapsed Time: %ss" % (time.time() - start))

第二个例子使用多处理包，它更简单一些。因为您只需要说明池的数量并映射该函数。获取上下文后，顺序不会改变，速度与第一个示例相似，但速度比其他方法快得多。

from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import time

os.chdir('file_path')

start = time.time()

def fetch_url(x):
    print('Getting Data')
    myurl = ("http://finance.yahoo.com/q/cp?s=%s" % x)
    html = requests.get(myurl).content
    soup = BeautifulSoup(html,'lxml')
    out = str(soup)
    listOut = [x, out]
    return listOut

tickDF = pd.read_excel('short_tickerlist.xlsx')
li = tickDF['Ticker'].tolist()    

if __name__ == '__main__':
    p = Pool(5)
    output = p.map(fetch_url, ji, chunksize=30)
    print("Time is %ss" %(time.time()-start))

Python urllib2.urlopen（）很慢，需要更好的方法来阅读几个网址

9 个答案: