Question

如何使用Python检索网页的页面标题（标题html标记）？

Answer 1

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string

注：

soup.title 在html文档中找到第一个 title 元素 where
title.string 假设它只有一个子节点，并且该子节点是字符串
< / LI>

对于beautifulsoup 4.x，请使用不同的导入：

from bs4 import BeautifulSoup

Answer 2

我将始终使用lxml执行此类任务。您也可以使用beautifulsoup。

import lxml.html
t = lxml.html.parse(url)
print t.find(".//title").text

Answer 3

mechanize Browser对象有一个title（）方法。因此this post中的代码可以重写为：

from mechanize import Browser
br = Browser()
br.open("http://www.google.com/")
print br.title()

Answer 4

这对于这样一个简单的任务来说可能有点过头了，但是如果你打算做更多的事情，那么从这些工具（机械化，BeautifulSoup）开始是更合理的，因为它们比替代品更容易使用（urllib来获取）内容和regexen或其他解析器来解析html）

链接： BeautifulSoup mechanize

#!/usr/bin/env python
#coding:utf-8

from BeautifulSoup import BeautifulSoup
from mechanize import Browser

#This retrieves the webpage content
br = Browser()
res = br.open("https://www.google.com/")
data = res.get_data() 

#This parses the content
soup = BeautifulSoup(data)
title = soup.find('title')

#This outputs the content :)
print title.renderContents()

Answer 5

Using HTMLParser:

from urllib.request import urlopen
from html.parser import HTMLParser


class TitleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.match = False
        self.title = ''

    def handle_starttag(self, tag, attributes):
        self.match = True if tag == 'title' else False

    def handle_data(self, data):
        if self.match:
            self.title = data
            self.match = False

url = "http://example.com/"
html_string = str(urlopen(url).read())

parser = TitleParser()
parser.feed(html_string)
print(parser.title)  # prints: Example Domain

Answer 6

无需导入其他库。请求具有内置的此功能。

>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb'

Answer 7

使用正则表达式

import re
match = re.search('<title>(.*?)</title>', raw_html)
title = match.group(1) if match else 'No title'

Answer 8

soup.title.string实际上返回一个unicode字符串。要将其转换为普通字符串，您需要这样做 string=string.encode('ascii','ignore')

Answer 9

使用汤。select_one定位标题标签

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)

Answer 10

这是一个容错HTMLParser实现如果发生任何意外情况，你可以在get_title()处扔掉任何东西而不会破坏它 get_title()将返回None 当Parser()下载页面时，它会将其编码为ASCII 无论页面中使用的字符集是否忽略任何错误。更改to_ascii()以将数据转换为UTF-8或任何其他编码将是微不足道的。只需添加一个编码参数，并将该函数重命名为to_encoding() 默认情况下，HTMLParser()会在破坏的html上中断，它甚至会破坏不匹配的标签之类的琐碎事情。为了防止出现这种情况，我用一个忽略错误的函数替换了HTMLParser()的错误方法。

#-*-coding:utf8;-*-
#qpy:3
#qpy:console

''' 
Extract the title from a web page using
the standard lib.
'''

from html.parser import HTMLParser
from urllib.request import urlopen
import urllib

def error_callback(*_, **__):
    pass

def is_string(data):
    return isinstance(data, str)

def is_bytes(data):
    return isinstance(data, bytes)

def to_ascii(data):
    if is_string(data):
        data = data.encode('ascii', errors='ignore')
    elif is_bytes(data):
        data = data.decode('ascii', errors='ignore')
    else:
        data = str(data).encode('ascii', errors='ignore')
    return data


class Parser(HTMLParser):
    def __init__(self, url):
        self.title = None
        self.rec = False
        HTMLParser.__init__(self)
        try:
            self.feed(to_ascii(urlopen(url).read()))
        except urllib.error.HTTPError:
            return
        except urllib.error.URLError:
            return
        except ValueError:
            return

        self.rec = False
        self.error = error_callback

    def handle_starttag(self, tag, attrs):
        if tag == 'title':
            self.rec = True

    def handle_data(self, data):
        if self.rec:
            self.title = data

    def handle_endtag(self, tag):
        if tag == 'title':
            self.rec = False


def get_title(url):
    return Parser(url).title

print(get_title('http://www.google.com'))

Answer 11

使用lxml ...

从根据Facebook opengraph协议标记的页面元中获取它：

import lxml.html.parse
html_doc = lxml.html.parse(some_url)

t = html_doc.xpath('//meta[@property="og:title"]/@content')[0]

或在.xml中使用lxml：

t = html_doc.xpath(".//title")[0].text

Answer 12

在 Python3 中，我们可以从 urlopen 和 urllib.request 库中调用 BeautifulSoup 方法来获取页面标题。

bs4

这里我们使用了最高效的解析器“lxml”。

如何使用Python检索网页的页面标题？

12 个答案: