Question

我是数据科学界的新手。最近，我试图通过使用python美丽的汤浏览所有下载链接按钮，从网站上获取一些pdf文件。但是，无论我使用哪种方法，都已被禁止使用403。我不知道是否是因为它具有某种类型的隐藏身份验证，这种身份验证的方式更加先进，超出了我的新手技能水平。以下是我到目前为止尝试过的所有详细信息。

网站：https://thescriptlab.com/screenplays/#wpfb-cat-2

无法从https://thescriptlab.com下拉菜单中找到此页面。我不知道这是否是故意的。
每个项目都可以单击并重定向到下载页面。
当我单击下载按钮时，它询问帐户信息是否是首次使用。创建帐户并登录后，我无法注销任何奇怪的帐户。

当我比较第一项和下载链接时，它看起来非常相似，我认为这很容易。事实告诉我，我还太年轻，天真。这是链接。

项目链接：https://thescriptlab.com/script-library/1_1x01_-_pilot-pdf
下载链接：https://thescriptlab.com/download/screenplays/1_1x01_-_Pilot.pdf

这是我做的代码：

import requests
from bs4 import BeautifulSoup
import pandas as pd

user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.37"
url = 'https://thescriptlab.com/screenplays/#wpfb-cat-2'
r = requests.get(url, headers={'User-Agent':user_agent})
soup = BeautifulSoup(r.content, 'html.parser')

screenplay_html = soup.find_all('span', class_="")[2:]

screenplay_url_lst = []
for i in range(len(screenplay_html)):
    screenplay_url = screenplay_html[i].find('a').get('href')
    screenplay_url_lst.append(screenplay_url)
        
screenplay_download_lst = []
for i in screenplay_url_lst:
    download_link = f'https://thescriptlab.com/download/screenplays/{i[40:-5]}.pdf'
    screenplay_download_lst.append(download_link)

最终，我制作了下面所有下载链接的数据框。 [下载链接数据框] [1] [1]：https：//i.stack.imgur.com/J1JVs.png

当我单击浏览器中的单个链接时，如果我已经登录，则可以使用。如果没有，则它将提示该一次性帐户注册页面。但是，无论我使用 urlretrieve 还是 request.get ，我都会收到403禁止访问。

这是我使用的代码：

user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.37"
url = 'https://thescriptlab.com/screenplays/#wpfb-cat-2'

for index, link in download_list.iterrows():
    my_session = requests.session()
    for_cookies = my_session.get(url)
    cookies = for_cookies.cookies
    my_url = link['download']
    my_title = link['title']

    response = my_session.get(my_url, headers={'User-Agent':user_agent}, cookies=cookies)
    print(response.status_code)  # 200

    urllib.request.urlretrieve(my_url, my_title)
    time.sleep(60)

然后，我决定尝试仅下载一个链接，即可下载文件。但是，它不是一个完整的文件，只有146个字节已损坏。

test_url = 'https://thescriptlab.com/download/screenplays/1_1x01_-_pilot.pdf'
r = requests.get(test_url)
with open('$1 - Season 1 - Episode 1 - Pilot', 'wb') as outfile:
    outfile.write(r.content)

任何网络抓取专家都能给我提供我所缺少的方向吗？或主要问题是如何解决这个隐藏的登录/身份验证页面？

如何解决隐藏的登录网页做python网页抓取？

0 个答案: