from urllib.request import urlopen
from bs4 import BeautifulSoup
page_origin = urlopen("https://stackoverflow.com")
page_html = page_origin.read()
page_origin.close()
print(page_html)
结果是https://stackoverflow.com的完整html代码。工作正常。因为它太长了,所以我不粘贴它。
问题出在BeautifulSoup。我添加两行代码以使用BeautifulSoup分析html。奇怪的事情发生了。根本没用。
from urllib.request import urlopen
from bs4 import BeautifulSoup
page_origin = urlopen("https://stackoverflow.com")
page_html = page_origin.read()
page_origin.close()
# print(page_html)
page_soup = BeautifulSoup(page_html, features="lxml", from_encoding="gbk")
print(page_soup)
结果非常非常简单。
<!DOCTYPE html>
<html class="html__responsive">
<head>
<title>
Stack Overflow - Where Developers Learn, Share, & Build Careers
</title>
<link href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d" rel="shortcut icon"/>
<link href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a" rel="apple-touch-icon image_src"/>
<link href="/opensearch.xml" rel="search" title="Stack Overflow" type="application/opensearchdescription+xml"/>
</head>
</html>
这不是html的完整代码,我根本无法对其进行分析。
请帮助我,我调试了太多时间。谢谢。
答案 0 :(得分:0)
这为我提供了完整的源代码:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://stackoverflow.com/')
soup = BeautifulSoup(r.text, 'lxml')
print(soup)