我正在尝试获取this URL中列表的标题,但是此代码返回None。
import requests
from bs4 import BeautifulSoup
# get the data
data = requests.get('https://www.lamudi.com.ph/metro-manila/makati/condominium/buy/')
# Update Header
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0)
Gecko/20100101 Firefox/31.0',
})
# load data into bs4
soup = BeautifulSoup(data.text, 'html.parser')
# We need to extract all the data in this div: <div
class="ListingCell-KeyInfo-title" ..>
listingsTitle = soup.find('div', { 'class': 'ListingCell-KeyInfo-title'})
print(listingsTitle)
有人知道为什么吗?
谢谢
答案 0 :(得分:0)
您请求的网址将您视为漫游器。
请求响应:
h1>Pardon Our Interruption...</h1>
<p>
As you were browsing <strong>www.lamudi.com.ph</strong> something about your
browser made us think you were a bot. There are a few reasons this might happen:
</p>
<ul>
在解析响应中的任何内容之前。
首先打印内容,以确保您以正确的方式访问了网址。
您必须添加User-Agent或其他东西才能使您成为真实用户
尝试将其添加到您的请求标头中:
USER_AGENT_FIREFOX= 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0'
答案 1 :(得分:0)
我尝试了硒测试,并经过了特定的等待,但是没有用。 如果打印汤,则可能会出错。实际上,页面返回以下内容:”当您浏览 www.lamudi.com.ph 时,有关您的浏览器的某些信息使我们认为您是机器人。可能有以下几种原因: ...“
该网站认识到您不是人类。
import requests
from bs4 import BeautifulSoup
# get the data
data = requests.get('https://www.lamudi.com.ph/metro-manila/makati/condominium/buy/')
# load data into bs4
soup = BeautifulSoup(data.text, 'html.parser')
# We need to extract all the data in this div: <div class="ListingCell-KeyInfo-title" ..>
print(soup) #--> this print get the error
listingsTitle = soup.find('div', class_='ListingCell-KeyInfo-title')
print(listingsTitle)