Question

我正在尝试删除Instagram页面，并希望获取/访问span-tag中存在的div-tag。但是我不能！ Instagram页面的HTML看起来像

 <head>--</head>
    <body>
       <span id="react-root" aria-hidden="false">
       <form enctype="multipart/form-data" method="POST" role="presentation">…</form>
       <section class="_9eogI E3X2T">
          <main class="SCxLW  o64aR" role="main">
             <div class="v9tJq VfzDr">
                 <header class=" HVbuG">…</header>
                 <div class="_4bSq7">…</div>
                 <div class="fx7hk">…</div>
             </div>
          </main>
      </section>
    </body>

我做到了，就像

from bs4 import BeautifulSoup
import urllib.request as urllib2
html_page = urllib2.urlopen("https://www.instagram.com/cherrified_/?hl=en")
soup = BeautifulSoup(html_page,"lxml")
span_tag = soup.find('span') # return span-tag correctly
span_tag.find_all('div')    # return empty list, why ?

请同时指定一个示例。

Answer 1

Instagram是由React提供支持的单页应用程序，这意味着它的源头只是一个简单的“空”页面，该页面会加载JavaScript以在下载后在浏览器中动态生成内容。

点击“查看源代码”或在Chrome中转到view-source:https://www.instagram.com/cherrified_/?hl=en。这是您通过urllib.request下载的HTML。

您会看到只有一个<span>标签，其中没有<div>标签。（注意：<div> is not allowed内的<span>）。

无法以这种方式抓取instagram.com。 It also might not be legal（我不是律师）。

注意：

您的HTML代码示例不包含<span>的结束标记。
您的HTML代码示例与您在python代码段中提供的链接不匹配。
在python代码段的最后一行中，您可能是指span_tag.find_all('div')（请注意变量名和单数'div'）。

使用BeautifulSoup无法从span-tag中获取数据

1 个答案: