Question

我正在尝试抓取包含以下HTML代码的网站：

 <div class="content-sidebar-wrap"><main class="content"><article 
 class="post-773 post type-post status-publish format-standard has-post-
 thumbnail category-money entry" itemscope 
 itemtype="http://schema.org/CreativeWork">

这包含我感兴趣的数据......我尝试使用BeautifulSoup来解析它，但以下内容会返回：

 <div class="content-sidebar-wrap"><main class="content"><article 
 class="entry">
 <h1 class="entry-title">Not found, error 404</h1><div class="entry-content
 "><p>"The page you are looking for no longer exists. Perhaps you can return 
 back to the site's "<a href="http://www.totalsportek.com/">homepage</a> and 
 see if you can find what you are looking for. Or, you can try finding it
 by using the search form below.</p><form 
 action="http://www.totalsportek.com/" class="search-form" 
 itemprop="potentialAction" itemscope="" 
 itemtype="http://schema.org/SearchAction" method="get" role="search">

 # I've made small modifications to make it readable

漂亮的汤元素不包含我想要的代码。我不太熟悉html，但我假设这会调用一些返回数据的外部服务..？我已经读过这与Schema有关。

无论如何我可以访问这些数据吗？

Answer 1

您在提出请求时需要指定User-Agent标头。打印文章标题和内容的工作示例：

import requests
from bs4 import BeautifulSoup

url = "http://www.totalsportek.com/money/barcelona-player-salaries/"

response = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36"})
soup = BeautifulSoup(response.content, "html.parser")

article = soup.select_one(".content article.post.entry.status-publish")
header = article.header.get_text(strip=True)
content = article.select_one(".entry-content").get_text(strip=True)

print(header)
print(content)

用漂亮的汤刮痧模式？

1 个答案: