Question

我有一个python 3脚本，它使用库urllib.request和BeautifulSoup加载网站内容，并将信息从它导出到csv文件或MySQL数据库。以下是脚本的主要代码：

# ... 

url = urllib.request.urlopen("<urls here>")
html = url.read()
url.close()
soup = BeautifulSoup(html, "html.parser")
# Create lists for html elements
nadpis = soup.find_all("span", class_="nadpis")     
# Some more soups here...

onpage = len(no) # No. of elements on page
for i in range(onpage):
    nadpis[i] = one_column(nadpis[i].string)
    # Some more soups here

if csv_export:
    with open("export/" + category[c][0] + ".csv", "ab") as csv_file:
        wr = csv.writer(csv_file, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL, lineterminator='\n') 
        wr.writerow("<informations from soup>")

# Insert to database
if db_insert:
    try:        
        cursor.execute("<informations from soup>")
        conn.commit()
    except Exception:
        print("Some MySQL error...")
        break

# ...

完整脚本有200行代码，所以我不会在这里发送垃圾邮件。一切正常。问题是我需要从大量网页中扫描和导出信息（一切都在while循环中，但现在没有必要）并且它变得非常慢（运行时间小时）。

有更快的方法吗？

我实现了多处理，因此我可以利用每个CPU内核，但无论如何，它可能需要24小时才能导出所有内容。我甚至在Amazon EC2服务器上进行了测试，但无论如何它并不快，所以问题不在于我的PC或互联网连接速度慢。

Answer 1

如果您遇到性能问题，我建议您启动profiling您的代码。这将为您提供有关代码大部分时间运行时间的详细信息。您还可以测量脚本废弃每个网页所需的时间，也许您会发现某些网页的加载时间比其他网页要多得多，这表明您不受带宽的限制，而是通过您尝试访问的服务器。

然而，你怎么称呼＆＃39;吨网页＆＃39;？如果您的脚本经过合理优化，并且如果您使用的是所有CPU核心，看起来您可能只需要废弃许多网页，以便按照您的需要快速完成（顺便说一下，您希望它有多快？）

Answer 2

我会推荐simple-requests，例如：

from simple_requests import Requests

# Creates a session and thread pool
requests = Requests(concurrent=2, minSecondsBetweenRequests=0.15)

# Cookies are maintained in this instance of Requests, so subsequent requests
# will still be logged-in.
urls = [
    'http://www.url1.com/',
    'http://www.url2.com/',
    'http://www.url3.com/' ]

# Asynchronously send all the requests for profile pages
for url_response in requests.swarm(urls):
    html = url_response.html
    soup = BeautifulSoup(html, "html.parser")

    # Some more soups here...

    # Write out your file...

从大量网页导出信息

2 个答案: