使用 BeautifulSoup 进行网页抓取时无法捕获 html 元素

时间:2021-01-22 19:53:55

标签: python python-3.x web-scraping beautifulsoup

无法捕获 html 元素,同时使用 BeautifulSoup 进行网页抓取

我在 pycharm 中运行以下代码

from bs4 import BeautifulSoup
import requests

source=requests.get("https://twitter.com/SGanguly").text

soup=BeautifulSoup(source,'lxml')
print(soup.prettify())

输出不包含要捕获的 html 元素,它只显示标签中的错误而不是类和 html元素

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0,viewport-fit=cover" name="viewport"/>
  <style>
   body {
      -ms-overflow-style: scrollbar;
      overflow-y: scroll;
      overscroll-behavior-y: none;
    }

    .errorContainer {
      background-color: #FFF;
      color: #0F1419;
      max-width: 600px;
      margin: 0 auto;
      padding: 10%;
      font-family: Helvetica, sans-serif;
      font-size: 16px;
    }

    .errorButton {
      margin: 3em 0;
    }

    .errorButton a {
      background: #1DA1F2;
      border-radius: 2.5em;
      color: white;
      padding: 1em 2em;
      text-decoration: none;
    }

    .errorButton a:hover,
    .errorButton a:focus {
      background: rgb(26, 145, 218);
    }

    .errorFooter {
      color: #657786;
      font-size: 80%;
      line-height: 1.5;
      padding: 1em 0;
    }

    .errorFooter a,
    .errorFooter a:visited {
      color: #657786;
      text-decoration: none;
      padding-right: 1em;
    }

    .errorFooter a:hover,
    .errorFooter a:active {
      text-decoration: underline;
    }
  </style>
 </head>
 <body>
  <div class="errorContainer">
   <img alt="Twitter" height="38" src="https://abs.twimg.com/errors/logo46x38.png" srcset="https://abs.twimg.com/errors/logo46x38.png 1x, https://abs.twimg.com/errors/logo46x38@2x.png 2x" width="46"/>
   <h1>
    This browser is no longer supported.
   </h1>
   <p>
    Please switch to a supported browser to continue using twitter.com. You can see a list of supported browsers in our Help Center.
   </p>
   <p class="errorButton">
    <a href="https://help.twitter.com/using-twitter/twitter-supported-browsers">
     Help Center
    </a>
   </p>
   <p class="errorFooter">
    <a href="https://twitter.com/tos">
     Terms of Service
    </a>
    <a href="https://twitter.com/privacy">
     Privacy Policy
    </a>
    <a href="https://support.twitter.com/articles/20170514">
     Cookie Policy
    </a>
    <a href="https://legal.twitter.com/imprint">
     Imprint
    </a>
    <a href="https://business.twitter.com/en/help/troubleshooting/how-twitter-ads-work.html">
     Ads info
    </a>
    © 2021 Twitter, Inc.
   </p>
  </div>
 </body>
</html>

1 个答案:

答案 0 :(得分:0)

对您的问题的直接回答是,您需要提供至少定义了用户代理的标头。但是,twitter 依赖于 javascript,因此您仍然无法单独使用 requests 和 beautifulsoup 来抓取数据。您还需要能够解析 javascript 代码的东西。我个人使用硒,但也有其他选择。