Question

我尝试使用beautifulSoup从html文件中提取信息。

<a href="/s?_encoding=UTF8&amp;field-author=Reza%20Aslan&amp;search-alias=books&amp;sort=relevancerank">Reza Aslan</a> <span class="byLinePipe">(Author)</span>

我使用美丽的汤findAll功能从前面的代码中提取作者Reza Azlan

import urllib2
from bs4 import BeautifulSoup
import re


ecj_data = open("book1.html",'r').read()

soup = BeautifulSoup(ecj_data)

for definition in soup.findAll('span', {"class":'byLinePipe'}):
    definition = definition.renderContents()

print definition命令给了我：＆＃34;发布日期：＆＃34;

这意味着还有另一个课程有＆＃34; byLiniePipe＆＃34;

<div class="buying"><span class="byLinePipe">Release date: </span><span style="font-weight: bold;">July 16, 2013</span> </div>

有没有人知道如何区分这些代码集以打印出作者姓名？

Answer 1

最好在作者姓名附近找到一个独特的标记，而不是通过类似类的元素集合。例如，我们可以使用其唯一的id找到图书的标题，然后使用find_next函数找到它的下一个链接（其中包含作者的名称）。请参阅下面的代码。

<强>代码：

from bs4 import BeautifulSoup as bsoup
import requests as rq

url = "http://www.amazon.com/Zealot-Times-Jesus-Nazareth-ebook/dp/B00BRUQ7ZY"
r = rq.get(url)
soup = bsoup(r.content)

title = soup.find("span", id="btAsinTitle")
author = title.find_next("a", href=True)

print title.get_text()
print author.get_text()

<强>结果：

Zealot: The Life and Times of Jesus of Nazareth [Kindle Edition]
Reza Aslan
[Finished in 2.4s]

希望这有帮助。

使用BeautifulSoup从Amazon页面提取作者姓名

1 个答案: