Question

我尝试从此PGN's下载所有site。

我想我必须使用Function getComputerXML($doc){ $computer = $doc.CreateNode("element","Computer",$null) $computerSettings = $doc.CreateNode("element","ComputerSettings",$null) $computerSettings.SetAttribute("Name","HP") | Out-Null $computerSettings.InnerText = "someText" $computer.AppendChild($computerSettings) return $computer } Function main(){ [xml]$doc = New-Object System.Xml.XmlDocument $computer = getComputerXML $doc #$computers.AppendChild($computer) } main打开每个网址，然后使用urlopen通过从每个游戏底部附近的下载按钮访问每个网页来下载每个网页。我是否必须为每个游戏创建一个新的urlretrieve对象？我也不确定BeautifulSoup的工作原理。

urlretrieve

Answer 1

您的问题没有简短的答案。我将向您展示一个完整的解决方案，并评论此代码。

首先，导入必要的模块：

from bs4 import BeautifulSoup
import requests
import re

接下来，获取索引页面并创建BeautifulSoup对象：

req = requests.get("http://www.chessgames.com/perl/chesscollection?cid=1014492")
soup = BeautifulSoup(req.text, "lxml")

我强烈建议使用lxml解析器，而不是常见的html.parser 之后，你应该准备游戏的链接列表：

pages = soup.findAll('a', href=re.compile('.*chessgame\?.*'))

您可以通过搜索包含＆＃39; chessgame＆＃39;的链接来实现。在它的词。现在，您应该准备将为您下载文件的功能：

def download_file(url):
    path = url.split('/')[-1].split('?')[0]
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(path, 'wb') as f:
            for chunk in r:
                f.write(chunk)

最后的魔法是重复所有前面准备文件下载器链接的步骤：

host = 'http://www.chessgames.com'
for page in pages:
    url = host + page.get('href')
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "lxml")
    file_link = soup.find('a',text=re.compile('.*download.*'))
    file_url = host + file_link.get('href')
    download_file(file_url)

（首先搜索包含文字的链接＆＃39;在其描述中下载＆＃39;然后构建完整的网址 - 连接主机名和路径，最后下载文件）

我希望您无需更正即可使用此代码！

Answer 2

accepted answer 很棒，但任务是 embarrassingly parallel；无需一次检索这些子页面和文件。这个答案展示了如何加快速度。

第一步是在向单个主机发送多个请求时使用 requests.Session()。引用 requests 文档中的 Advanced Usage: Session Objects：

<块引用>

Session 对象允许您跨请求保留某些参数。它还在从 Session 实例发出的所有请求中保留 cookie，并将使用 urllib3 的 connection pooling。因此，如果您向同一主机发出多个请求，则底层 TCP 连接将被重用，这可能会显着提高性能（请参阅 HTTP persistent connection）。

接下来，可以使用异步、多处理或多线程来并行化工作负载。每个都针对手头的任务进行权衡，您选择的可能最好通过基准测试和分析来确定。 This page 为这三者提供了很好的例子。

就本文而言，我将展示多线程。 GIL 的影响不应该是太大的瓶颈，因为任务大多是 IO 绑定的，包括空中的保姆请求以等待响应。当线程在 IO 上被阻塞时，它可以让步给解析 HTML 或执行其他 CPU 密集型工作的线程。

代码如下：

import os
import re
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

def download_pgn(task):
    session, url, destination_path = task
    response = session.get(url)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")
    game_url = host + soup.find("a", text="download").get("href")
    filename = re.search(r"\w+\.pgn", game_url).group()
    path = os.path.join(destination_path, filename)
    response = session.get(game_url, stream=True)
    response.raise_for_status()

    with open(path, "wb") as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)

if __name__ == "__main__":
    host = "http://www.chessgames.com"
    url_to_scrape = host + "/perl/chesscollection?cid=1014492"
    destination_path = "pgns"
    max_workers = 8

    if not os.path.exists(destination_path):
        os.makedirs(destination_path)
    
    with requests.Session() as session:
        response = session.get(url_to_scrape)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")
        pages = soup.find_all("a", href=re.compile(r".*chessgame\?.*"))
        tasks = [
            (session, host + page.get("href"), destination_path) 
            for page in pages
        ]

        with ThreadPoolExecutor(max_workers=max_workers) as pool:
            pool.map(download_pgn, tasks)

我在这里使用了 response.iter_content，这在如此小的文本文件上是不必要的，但它是一种概括，因此代码将以内存友好的方式处理较大的文件。

粗略基准测试的结果（第一个请求是瓶颈）：

<头>

最大工人	会话？	秒
1	没有	126
1	是的	111
8	没有	24
8	是的	22
32	是的	16

Python下载多个文件

2 个答案: