我正在尝试使用BeautifulSoup从Glassdoor抓取公司评论。但是未能从该站点提取任何内容。我正在使用以下代码-
from requests import get
from bs4 import BeautifulSoup
url = "https://www.glassdoor.com/Reviews/The-Wonderful-Company-Reviews-E1005987_P2.htm?
sort.sortType=RD&sort.ascending=false"
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
html_soup
我发现上述代码无法提取任何内容,并且显示为-“不允许机器人” 。我已经分享了下面的输出。
<!DOCTYPE html>
<html><head><title></title><style type="text/css">H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}.line {height: 1px; background-color: #525D76; border: none;}</style> </head><body><h1>HTTP Status 403 - Bots not allowed</h1><div class="line"></div><p><b>type</b> Status report</p><p><b>message</b> <u>Bots not allowed</u></p><p><b>description</b> <u>Access to the specified resource has been forbidden.</u></p><hr class="line"/><h3>Apache Tomcat</h3></body></html>
我是网络抓取领域的新手。有人可以指导我如何从Glass door中提取评论。
答案 0 :(得分:0)
要从服务器获得正确的响应,请设置User-Agent
HTTP标头:
from requests import get
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0'}
url = "https://www.glassdoor.com/Reviews/The-Wonderful-Company-Reviews-E1005987_P2.htm?sort.sortType=RD&sort.ascending=false"
response = get(url, headers=headers)
html_soup = BeautifulSoup(response.text, 'html.parser')
print(html_soup)