Question

我必须下载URL链接的所有html。但是我没有任何HTML标签。相反，我只有这些行。

<!doctype html>
<html lang="en">
<head><meta charset="utf-8">  
<meta name="viewport" content="width=device-width, initial-scale=1">    <title></title>
 <link href="/github-user-search/app.bundle.562f293b75a96de878ab.css" rel="stylesheet"></head><body>
 <div id="root"></div>
 <script type="text/javascript" src="/github-user-search/app.bundle.562f293b75a96de878ab.js"></script></body>
 </html>

import requests
import urllib.request
import time
from bs4 import BeautifulSoup


url ='https://simonsmith.github.io/github-user-search/#/search?per_page=42&page=1&q=Ben%20Newman'
response = requests.get(url)
print(response.content)
soup = BeautifulSoup(response.text, 'html.parser')
soup.findAll('a')

Answer 1

当您对上述网址发出请求时，内容将由javascript异步加载，因此，当您使用该内容时，将无法抓取异步加载的内容您需要等待内容加载后再进行解析，我建议使用phantom js / puppeteer等待动态内容加载，然后抓取，使用下面的jquery等到内容加载，再等到用户详细信息加载到页面上，然后继续操作数据提取

$('*[class^="User_"]')

HTML中包含Java脚本。如何从中提取HTML标签

1 个答案: