Python BS4:忽略父目录

时间:2020-04-16 18:52:28

标签: python web-scraping

我正在尝试找出我做错了什么,但似乎找不到。.

我正在尝试创建一个非常简单的刮板,但是我遇到的问题是,如果满足给定的条件,我想在for循环内执行continue。基本上,我希望for循环只是当link.text等于Parent Directory时,继续不创建文件夹。排除部分在第48行。它始终创建“父母目录”。

我正在使用此测试:http://artscene.textfiles.com/music/mods/MODS/MODLAND/Maktone/作为测试,如果您输入网站,则我尝试在脚本中忽略顶部的“父目录” ...

我在做什么错了?

预先感谢

# Import what's needed
from bs4 import BeautifulSoup
import os
import requests
from pathlib import Path

# Our base url where files and sub-directories are located
# Script will download every file with the given extension(s)
# and if you want it to go through sub-directories it also will
# do that
urlInput = input('Enter a URL to scape: ')

# Download directory
dirInput = input('Enter directory name: ')

# Our download directory where all the files will be stored
# This is basically the script location and the download directory within
downloadDir = os.getcwd() + '\\' + 'download' + '\\' + dirInput + '\\'

# Extensions and exclude list
# Extensions are files we want to download
# Exclude's are link text's we want to ignore
exts = ['.mod', '.xm', '.it']
excludes = ['parent directory']

# Dowload given URL
# to given destination directory
def run(url, dest):

    # Grab page
    request = requests.get(url)

    # Create destination directory
    Path(dest).mkdir(parents = True, exist_ok = True)

    # Parse the HTML data
    html = request.text
    soup = BeautifulSoup(html)

    # Go through every link
    for link in soup.find_all('a'):

        # Grab href and link text (for naming purpose)
        href = link.get('href')
        text = link.text

        # Do some excludes
        if text.lower in excludes:
            continue

        # Grab the file extension from the URL
        hrefExt = os.path.splitext(href)[1]

        # Check that the file in this iteration is
        # in the extensions list
        if hrefExt in exts:
            file = requests.get(url + href)
            open(dest + text, 'wb').write(file.content)
            print('Downloaded: ' + url+text)

        elif (href.endswith('/')):
            run(url + href, dest + text + '\\')

run(urlInput, downloadDir)

1 个答案:

答案 0 :(得分:0)

好吧,没关系。.我忘记了lower上的括号...已解决。