尝试打印doctype声明

时间:2017-04-05 23:59:21

标签: python regex web-scraping beautifulsoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.bbc.co.uk/iplayer/live/bbcone?area=london")
bsObj = BeautifulSoup(html, "html.parser")
version = bsObj.find(string = re.compile('DOCTYPE html'))

if version in bsObj:
    print("Yes")
else:
    print("No")

我知道" http://www.bbc.co.uk/iplayer/live/bbcone?area=london"的doctype声明是html 5(!DOCTYPE html),但是当我运行这个脚本时,输出是" No"。我做错了什么?

1 个答案:

答案 0 :(得分:0)

Doctype是对浏览器的指令,因此find和find_all无法正常查找,因为它不是html标记。

除此之外,您的正则表达式无法正常工作,因为BS中的string值仅为html而不是DOCTYPE html

您可以使用用户kindall提及的链接或以这种方式使用它:

import requests
from bs4 import BeautifulSoup, Doctype

html = requests.get("http://www.bbc.co.uk/iplayer/live/bbcone?area=london")
soup = BeautifulSoup(html.content, "html.parser")
version = soup.find_all(string="html")
DOCTYPE = next(item for item in version if isinstance(item, Doctype))

print (DOCTYPE)

将打印:

  

HTML