我正在尝试找出我做错了什么,但似乎找不到。.
我正在尝试创建一个非常简单的刮板,但是我遇到的问题是,如果满足给定的条件,我想在for循环内执行continue
。基本上,我希望for循环只是当link.text
等于Parent Directory
时,继续不创建文件夹。排除部分在第48行。它始终创建“父母目录”。
我正在使用此测试:http://artscene.textfiles.com/music/mods/MODS/MODLAND/Maktone/作为测试,如果您输入网站,则我尝试在脚本中忽略顶部的“父目录” ...
我在做什么错了?
预先感谢
# Import what's needed
from bs4 import BeautifulSoup
import os
import requests
from pathlib import Path
# Our base url where files and sub-directories are located
# Script will download every file with the given extension(s)
# and if you want it to go through sub-directories it also will
# do that
urlInput = input('Enter a URL to scape: ')
# Download directory
dirInput = input('Enter directory name: ')
# Our download directory where all the files will be stored
# This is basically the script location and the download directory within
downloadDir = os.getcwd() + '\\' + 'download' + '\\' + dirInput + '\\'
# Extensions and exclude list
# Extensions are files we want to download
# Exclude's are link text's we want to ignore
exts = ['.mod', '.xm', '.it']
excludes = ['parent directory']
# Dowload given URL
# to given destination directory
def run(url, dest):
# Grab page
request = requests.get(url)
# Create destination directory
Path(dest).mkdir(parents = True, exist_ok = True)
# Parse the HTML data
html = request.text
soup = BeautifulSoup(html)
# Go through every link
for link in soup.find_all('a'):
# Grab href and link text (for naming purpose)
href = link.get('href')
text = link.text
# Do some excludes
if text.lower in excludes:
continue
# Grab the file extension from the URL
hrefExt = os.path.splitext(href)[1]
# Check that the file in this iteration is
# in the extensions list
if hrefExt in exts:
file = requests.get(url + href)
open(dest + text, 'wb').write(file.content)
print('Downloaded: ' + url+text)
elif (href.endswith('/')):
run(url + href, dest + text + '\\')
run(urlInput, downloadDir)
答案 0 :(得分:0)
好吧,没关系。.我忘记了lower
上的括号...已解决。