Question

我正在尝试获取this URL中列表的标题，但是此代码返回None。

import requests 
from bs4 import BeautifulSoup  

# get the data 
data = requests.get('https://www.lamudi.com.ph/metro-manila/makati/condominium/buy/')

# Update Header
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0) 
Gecko/20100101 Firefox/31.0',
})
# load data into bs4
soup = BeautifulSoup(data.text, 'html.parser')

# We need to extract all the data in this div: <div 
class="ListingCell-KeyInfo-title" ..>

listingsTitle = soup.find('div', { 'class': 'ListingCell-KeyInfo-title'})
print(listingsTitle)

有人知道为什么吗？

谢谢

Answer 1

您请求的网址将您视为漫游器。

请求响应：

h1>Pardon Our Interruption...</h1>
<p>
      As you were browsing <strong>www.lamudi.com.ph</strong> something about your 
browser made us think you were a bot. There are a few reasons this might happen:
        </p>
<ul>

在解析响应中的任何内容之前。

首先打印内容，以确保您以正确的方式访问了网址。

您必须添加User-Agent或其他东西才能使您成为真实用户

尝试将其添加到您的请求标头中：

USER_AGENT_FIREFOX= 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0'

Answer 2

我尝试了硒测试，并经过了特定的等待，但是没有用。如果打印汤，则可能会出错。实际上，页面返回以下内容：”当您浏览 www.lamudi.com.ph 时，有关您的浏览器的某些信息使我们认为您是机器人。可能有以下几种原因： ...“

该网站认识到您不是人类。

import requests 
from bs4 import BeautifulSoup  

# get the data 
data = requests.get('https://www.lamudi.com.ph/metro-manila/makati/condominium/buy/')

# load data into bs4
soup = BeautifulSoup(data.text, 'html.parser')

# We need to extract all the data in this div: <div class="ListingCell-KeyInfo-title" ..>
print(soup)    #--> this print get the error

listingsTitle = soup.find('div', class_='ListingCell-KeyInfo-title')
print(listingsTitle)

BeautifulSoup返回无

2 个答案: