Question

对于我的课程，我必须构建一个网络抓取工具，该抓取工具将img，word docs和pdf放到网站上并将其下载到文件中，我得到了img下载工作，但是当我更改代码以下载docs或pdf时，根本找不到任何东西，我使用beautifulsoup抓取了网站，而且我知道网站上有docs和pdf只是无法下载。

from bs4 import BeautifulSoup
import urllib.request
import shutil
import requests
from urllib.parse import urljoin
import sys
import time
import os
import url
import hashlib
import re

url = 'http://www.soc.napier.ac.uk/~40009856/CW/'

path=('c:\\temp\\')

def ensure_dir(path):
    directory = os.path.dirname(path)
    if not os.path.exists(path):
        os.makedirs(directory) 
    return path

os.chdir(ensure_dir(path))

def webget(url): 
    response = requests.get(url)
    html = response.content
    return html

def get_docs(url):
    soup = make_soup(url)
    docutments = [docs for docs in soup.findAll('doc')]
    print (str(len(docutments)) + " docutments found.")
    print('Downloading docutments to current working directory.')
    docutments_links = [each.get('src') for each in docutments]
    for each in docutments_links:
        try:
            filename = each.strip().split('/')[-1].strip()
            src = urljoin(url, each)
            print ('Getting: ' + filename)
            response = requests.get(src, stream=True)
            # delay to avoid corrupted previews
            time.sleep(1)
            with open(filename, 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
        except:
            print('  An error occured. Continuing.')
    print ('Done.')

if __name__ == '__main__':
     get_docs(url)

Answer 1

首先，您应该阅读.find_all（）和其他方法的作用：.find_all()

.find_all（）的第一个参数是标记名称。

还可以

<img src='some_url'>

标签。您通过汤.find_all（'img'）获得了所有img标签，将所有URL提取到实际文件中并下载了它们。

现在您正在寻找

之类的标签

<a href='some_url'></a>

，其网址包含“ .doc”。这样的事情应该做到：

soup.select('a[href*=".doc"]')

Answer 2

更多，但您可以使用OR CSS选择器语法来组合pdf，docx等。请注意，您仍然需要完成一些路径，例如加上前缀"http://www.soc.napier.ac.uk/~40009856/CW/"。以下将attribute = value css选择器语法与$运算符结合使用（这意味着属性字符串的值以结尾）

from bs4 import BeautifulSoup
import requests
url= 'http://www.soc.napier.ac.uk/~40009856/CW/'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'lxml')
items = soup.select("[href$='.docx'], [href$='.pdf'], img[src]")
print([item['href'] if 'href' in item.attrs.keys()  else item['src'] for item in items])

下载Word文档python

2 个答案: