I have written code to extract the url and title of a book using BeautifulSoup
from a page.
But it is not extracting the name of the book Astounding Stories of Super-Science April 1930 between >
and </a>
tags.
How can I extract the name of the book?
I have tried the findnext
method recommended in another question, but I get an AttributeError
on that.
HTML:
<li>
<a class="extiw" href="//www.gutenberg.org/ebooks/29390" title="ebook:29390">Astounding Stories of Super-Science April 1930</a>
<a class="image" href="/wiki/File:BookIcon.png"><img alt="BookIcon.png" height="16" src="//www.gutenberg.org/w/images/9/92/BookIcon.png" width="16"/></a>
(English)
</li>
Code below:
def make_soup(BASE_URL):
r = requests.get(BASE_URL, verify = False)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def extract_text_urls(html):
soup = make_soup(BASE_URL)
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a['title']
print "\n"
except KeyError:
pass
except TypeError:
pass
extract_text_urls(filename)
答案 0 :(得分:3)
您应该使用元素的text
属性。以下适用于我:
def make_soup(BASE_URL):
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def extract_text_urls(html):
soup = make_soup(BASE_URL)
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a.text
print "\n"
except KeyError:
pass
except TypeError:
pass
extract_text_urls('http://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)')
我得到了有问题元素的以下输出
//www.gutenberg.org/ebooks/29390 Astounding Stories of Super-Science April 1930
答案 1 :(得分:1)
我没看到你如何在标签中提取文字。我会做这样的事情:
from bs4 import BeatifulSoup as bs
from urllib2 import urlopen as uo
soup = bs(uo(html))
for li in soup.findall('li'):
a = li.find('a')
book_title = a.contents[0]
print book_title
答案 2 :(得分:1)
要获取不在任何标记内的文本,请使用get_text()
方法。它位于文档here。
我无法测试它,因为我不知道您要抓取的网页的网址,但您可能只是使用li
标记,因为似乎没有其他文字。
尝试替换它:
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a['title']
print "\n"
except KeyError:
pass
except TypeError:
pass
用这个:
for li in soup.findAll('li'):
try:
print(li.get_text())
print("\n")
except TypeError:
pass