我正在尝试使用python请求对以下提到的URL进行网页抓取,但无法成功。
网址:https://support.oracle.com/rs?type=doc&id=1439822.1
无效代码:
import requests
from bs4 import BeautifulSoup
s = requests.session()
s.headers.update(headers)
r = s.get("https://support.oracle.com/rs?type=doc&id=1439822.1", auth=('user@email.com', 'mypass'), allow_redirects=True)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify())
预期输出:(通过网络浏览器获得输出,登录成功。实际上需要在命令行的输出下方)
注意:能够通过wget命令实现,但我需要处理python请求。
wget --user "user@email.com" --password "mypass" "https://support.oracle.com/rs?type=doc&id=1439822.1" -O /root/webout.html
感谢您的帮助!
答案 0 :(得分:0)
最终找到答案了!
import requests
from bs4 import BeautifulSoup
r = requests.get("https://support.oracle.com/rs?type=doc&id=1439822.1", auth=('user@email.com', 'mypass'), allow_redirects=True)
full_fetch = requests.get(r.url, auth=('user@email.com', 'mypass), allow_redirects=True)
soup = BeautifulSoup(full_fetch.text, 'html.parser')
print(soup.prettify())