Question

我注意到我是否使用

请求网址

.timeline-inverted div:nth-child(even)
.timeline-inverted div:nth-child(odd)

我得到这样的东西：

urllib.request.urlopen([my_url]).read()

我想要的所有关于beautifulsoup的信息都在<html> <head> </head> <body> <span>...</span> <body> <script> </script> </html>部分。如果我使用webdriver，那么包含该部分。但是webdriver似乎需要更长的时间，并且会导致我的代码变得更加混乱。有没有办法在不使用webdriver的情况下检索整个HTML文档？

Answer 1

这是一个更简单易读的解决方案，用于解析 <span> 标记的内容：

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

my_url = 'https://www.foo.com'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")
span_content = page_soup.findAll("span",{"<attribute_name>":"<attribute_value>"})
print(span_content.text)

Answer 2

您可以使用着名的请求库，看看以下代码是否可以帮助您

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.google.com/')
soup = BeautifulSoup(page.text, 'lxml')

span = soup.find_all('span')
print(span)

使用urlopen（url）检索整个HTML

2 个答案: