这是我正在平台上提取的html的一部分,它具有要获取的代码片段,即带有“ booktitle”类的标签的href属性值
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
使用mechanize库登录后,我可以使用这段代码来尝试提取它,但是在这里它按照代码的要求返回书的名称,我尝试了几种方法来仅获取href值,但没有一种方法可行远
from bs4 import BeautifulSoup as bs4
from requests import Session
from lxml import html
import Downloader as dw
import requests
def getGenders(browser : mc.Browser, url: str, name: str) -> None:
res = browser.open(url)
aux = res.read()
html2 = bs4(aux, 'html.parser')
with open(name, "w", encoding='utf-8') as file2:
file2.write( str( html2 ) )
getGenders(br, "https://www.goodreads.com/shelf/show/art", "gendersBooks.html")
with open("gendersBooks.html", "r", encoding='utf8') as file:
contents = file.read()
bsObj = bs4(contents, "lxml")
aux = open("books.text", "w", encoding='utf8')
officials = bsObj.find_all('a', {'class' : 'booktitle'})
for text in officials:
print(text.get_text())
aux.write(text.get_text().format())
aux.close()
file.close()
答案 0 :(得分:1)
您可以试试吗? (很抱歉,如果它不起作用,我现在不在使用python的电脑上)
for text in officials:
print(text['href'])
答案 1 :(得分:0)
BeautifulSoup可以很好地与您提供的html代码一起使用,如果要获取标记的文本,只需使用“ .text”,如果要获取href,则使用“ .get('href') ”,或者如果您确定标签具有href值,则可以使用“ ['href']”。
这是一个简单的示例,使用html代码片段很容易理解。
from bs4 import BeautifulSoup
html_code = '''
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
'''
soup = BeautifulSoup(html_code, 'html.parser')
tag = soup.find('a', {'class':'bookTitle'})
# - Book Title -
title = tag.text
print(title)
# - Href Link -
href = tag.get('href')
print(href)
我不知道您为什么下载html并将其保存到磁盘然后再次打开,如果您只想获取一些标签值,然后下载html并保存到磁盘然后重新打开,则完全没有必要,您可以将html保存为变量,然后将该变量传递给beautifulsoup。
现在,我看到您导入了请求库,但是据我所知,当从python中从网页获取数据时,请求是最简单,最现代的库,因此您使用了机械化。我还看到您从请求中导入了“会话”,除非您要发出多个请求并希望保持与服务器的连接打开以实现更快的次安全请求,否则会话不是必需的。
此外,如果您使用“ with”语句打开文件,则说明您正在使用python上下文管理器,该上下文管理器处理文件的关闭,这意味着您不必在最后关闭文件。
因此,无需保存下载的“ html”到磁盘,您的代码将更加简化。
from bs4 import BeautifulSoup
import requests
url = 'https://www.goodreads.com/shelf/show/art/gendersBooks.html'
html_source = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# - To get the tag that we want -
tag = soup.find('a', {'class' : 'booktitle'})
# - Extract Book Title -
href = tag.text
# - Extract href from Tag -
title = tag.get('href')
现在,如果您获得多个具有相同类名的“ a”标签:('a',{'class':'booktitle'}),则您可以这样做。
首先获取所有“ a”标签:
a_tags = soup.findAll('a', {'class' : 'booktitle'})
,然后抓取所有书籍标签信息,并将每个书籍信息附加到书籍列表中。
books = []
for a in a_tags:
try:
title = a.text
href = a.get('href')
books.append({'title':title, 'href':href}) #<-- add each book dict to books list
print(title)
print(href)
except:
pass
为了更好地理解您的代码,我建议您阅读以下相关链接:
BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
要求: https://requests.readthedocs.io/en/master/
Python上下文管理器: https://book.pythontips.com/en/latest/context_managers.html