网页搜罗-名侦探柯南

时间:2020-03-18 07:44:49

标签: python web-scraping

我正在尝试从 https://www.kiss-anime.ws/ 下载名侦探柯南的所有片段(对他们很赞)。从网站抓取下载URL时,我遇到了一个问题。

假设我要下载名侦探柯南的第一集,因此我使用此URL( https://www.kiss-anime.ws/Anime-detective-conan-1 )抓取下载URL。现在,当我尝试获取网站的HTML代码时,使用以下代码提取下载URL:

from urllib.request import Request, urlopen

req = Request('https://www.kiss-anime.ws/Anime-detective-conan-1', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

我收到以下错误

回溯(最近一次通话最近):文件“ refine.py”,第41行,在 网页= urlopen(req).read()文件“ /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py”, urlopen中的第222行 返回opener.open(URL,数据,超时)文件“ /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py”, 531行,处于打开状态 response = meth(req,response)文件“ /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py”, http_response中的第640行 响应= self.parent.error(文件“ /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py”, 第569行,错误 返回self._call_chain(* args)文件“ /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py”, _call_chain中的第502行 结果= func(* args)文件“ /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py”, 第649行,位于http_error_default中 引发HTTPError(req.full_url,code,msg,hdrs,fp)urllib.error.HTTPError:HTTP错误503:临时服务 不可用

由于有900多个剧集,所以我不想转到每个链接并手动单击下载链接。找到链接后,我将使用以下代码下载该剧集:(以防万一有人想知道我会怎么做)

import webbrowser
webbrowser.open("https://www.kiss-anime.ws/download.php?id=VkE3eFcvTlpZb0RvKzJ0Tmx2V2ROa3J4UWJ1U09Ic0VValh1WGNtY2Fvbz0=&key=B2X2kHBdIGdzAxn4kHmhXDq0XNq5XNu1WtujWq==&ts=1584489495")

任何帮助将不胜感激,谢谢!

1 个答案:

答案 0 :(得分:0)

因此,显然有808集。看一下这段代码,这里发生了很多事情,但是很容易理解。我对下载内容进行了大约5-6集的测试,并且可以正常工作...

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from http.client import RemoteDisconnected
import time


def get_browser():
    chrome_options = Options()
    chrome_options.add_argument("--disable-extensions")
    chrome_options.add_argument('--disable-notifications')
    chrome_options.add_argument('--incognito')
    driver = webdriver.Chrome(options=chrome_options)
    return driver


driver = get_browser()
page_url = "https://www.kiss-anime.ws/Anime-detective-conan-1"

try:
    driver.set_page_load_timeout(40)
    driver.get(page_url)
except TimeoutException:
    raise Exception(f"\t{page_url} - Timed out receiving message from renderer")
except RemoteDisconnected:
    raise Exception(f"\tError 404: {page_url} not found.")

WebDriverWait(driver, timeout=40).until(EC.presence_of_element_located((By.ID, "selectEpisode")))
driver.find_element_by_id("selectEpisode").click()
soup = BeautifulSoup(driver.page_source, "html.parser")

options = soup.find("select", attrs={"id": "selectEpisode"}).find_all("option")
print(f"Found {len(options)} episodes...")


base_url = "https://www.kiss-anime.ws/"
for idx, option in enumerate(options):
    print(f"Downloading {idx+1} of {len(options)}...")
    page_url = option['value']

    try:
        driver.set_page_load_timeout(40)
        driver.get(page_url)
    except TimeoutException:
        print(f"\t{page_url} - Timed out receiving message from renderer")
        continue
    except RemoteDisconnected:
        print(f"\tError 404: {page_url} not found.")
        continue

    WebDriverWait(driver, timeout=40).until(EC.presence_of_element_located((By.ID, "divDownload")))
    driver.find_element_by_id("divDownload").click()
    print(f"\t Downloading...")
    time.sleep(15)


driver.quit()
print("done")

因此,首先,我在chrome浏览器中打开URL,并读取下拉菜单值,总计为808。现在,我将遍历这808个URL,以获取下载视频所需的实际链接。通过在循环中使用.click(),我实际上是在模拟单击按钮,然后视频开始下载。请记住更改time.sleep(x),其中x代表您根据互联网速度下载一集所需的时间(秒)。

您需要使用selenium安装bs4pip install软件包。另外,下载chromedriver.exe并确保它与该脚本的路径相同。