我尝试使用beautifulSoup从html文件中提取信息。
<a href="/s?_encoding=UTF8&field-author=Reza%20Aslan&search-alias=books&sort=relevancerank">Reza Aslan</a> <span class="byLinePipe">(Author)</span>
我使用美丽的汤findAll功能从前面的代码中提取作者Reza Azlan
import urllib2
from bs4 import BeautifulSoup
import re
ecj_data = open("book1.html",'r').read()
soup = BeautifulSoup(ecj_data)
for definition in soup.findAll('span', {"class":'byLinePipe'}):
definition = definition.renderContents()
print definition
命令给了我:&#34;发布日期:&#34;
这意味着还有另一个课程有&#34; byLiniePipe&#34;
<div class="buying"><span class="byLinePipe">Release date: </span><span style="font-weight: bold;">July 16, 2013</span> </div>
有没有人知道如何区分这些代码集以打印出作者姓名?
答案 0 :(得分:0)
最好在作者姓名附近找到一个独特的标记,而不是通过类似类的元素集合。例如,我们可以使用其唯一的id
找到图书的标题,然后使用find_next
函数找到它的下一个链接(其中包含作者的名称)。请参阅下面的代码。
<强>代码:强>
from bs4 import BeautifulSoup as bsoup
import requests as rq
url = "http://www.amazon.com/Zealot-Times-Jesus-Nazareth-ebook/dp/B00BRUQ7ZY"
r = rq.get(url)
soup = bsoup(r.content)
title = soup.find("span", id="btAsinTitle")
author = title.find_next("a", href=True)
print title.get_text()
print author.get_text()
<强>结果:强>
Zealot: The Life and Times of Jesus of Nazareth [Kindle Edition]
Reza Aslan
[Finished in 2.4s]
希望这有帮助。