将<a> tag

时间:2015-09-19 18:50:07

标签: python beautifulsoup

I have written code to extract the url and title of a book using BeautifulSoup from a page.

But it is not extracting the name of the book Astounding Stories of Super-Science April 1930 between > and </a> tags.

How can I extract the name of the book?

I have tried the findnext method recommended in another question, but I get an AttributeError on that.

HTML:

    <li>
        <a class="extiw" href="//www.gutenberg.org/ebooks/29390" title="ebook:29390">Astounding Stories of Super-Science April 1930</a>
        <a class="image" href="/wiki/File:BookIcon.png"><img alt="BookIcon.png" height="16" src="//www.gutenberg.org/w/images/9/92/BookIcon.png" width="16"/></a>
        (English)
    </li>

Code below:

def make_soup(BASE_URL):
    r = requests.get(BASE_URL, verify = False)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def extract_text_urls(html):
    soup = make_soup(BASE_URL)

    for li in soup.findAll('li'):
        try:
            try:
                print li.a['href'], li.a['title']
                print "\n"
            except KeyError:
                pass
        except TypeError:
            pass

extract_text_urls(filename)

3 个答案:

答案 0 :(得分:3)

您应该使用元素的text属性。以下适用于我:

def make_soup(BASE_URL):
    r = requests.get(BASE_URL)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def extract_text_urls(html):
    soup = make_soup(BASE_URL)

    for li in soup.findAll('li'):
        try:
            try:
                print li.a['href'], li.a.text
                print "\n"
            except KeyError:
                pass
        except TypeError:
            pass

extract_text_urls('http://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)')

我得到了有问题元素的以下输出

//www.gutenberg.org/ebooks/29390 Astounding Stories of Super-Science April 1930

答案 1 :(得分:1)

我没看到你如何在标签中提取文字。我会做这样的事情:

from bs4 import BeatifulSoup as bs
from urllib2 import urlopen as uo
soup = bs(uo(html))

for li in soup.findall('li'):
    a = li.find('a')
    book_title = a.contents[0]
    print book_title

答案 2 :(得分:1)

要获取不在任何标记内的文本,请使用get_text()方法。它位于文档here

我无法测试它,因为我不知道您要抓取的网页的网址,但您可能只是使用li标记,因为似乎没有其他文字。

尝试替换它:

    for li in soup.findAll('li'):
    try:
        try:
            print li.a['href'], li.a['title']
            print "\n"
        except KeyError:
            pass
    except TypeError:
        pass

用这个:

    for li in soup.findAll('li'):
    try:
        print(li.get_text())
        print("\n")
    except TypeError:
        pass