Question

我试图在python中编写一个程序来读取网页中的所有数据，并将任何标题标记<h1>到<h6>的内容附加到列表中。到目前为止，我只是试图获取网站信息，这已被证明是困难的。

编辑：这是一个班级。遗憾的是，我们不允许使用预先安装python的库。

编辑2：感谢您的所有提示。该程序现在成功读取给定网站的HTML。有没有人建议在webPage中搜索特定的字符串（<H>标签）？

import urllib
from urllib.request import urlopen

#example URL that includes an <h> tag: http://www.hobo-web.co.uk/headers/
userAddress = input("Enter a website URL: ")

webPage = urllib.request.urlopen(userAddress)

print (webPage.read())

webPage.close()

Answer 1

我建议使用requests库。

import requests 

r = requests.get('http://www.hobo-web.co.uk/')
print(r.text)

查看http://docs.python-requests.org/en/latest/user/quickstart/

上的文档

Answer 2

我认为您正在使用python3来获取网页。它可以通过以下代码获取：

import urllib
from urllib.request import urlopen

address = "http://www.hobo-web.co.uk/headers/"
webPage = urllib.request.urlopen(address)

print (webPage.read())

要从网页中提取信息，您可以使用BeautifulSoup。这是一个从网页中提取信息的绝佳工具。您可以使用它来提取表格，列表，段落，还可以使用过滤器从网页中提取信息。

从这里安装：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

Answer 3

查看beautifulsoup图书馆。它是一个用于解析DOM树的API。你可以做像soup.find_all（＆＃39; h1＆＃39;）那样会返回所有h1元素的列表。

Answer 4

最好使用with open，以便自动关闭连接。下面是一个例子：

import urllib.request
address = "http://www.hobo-web.co.uk/headers/"
with urllib.request.urlopen(address) as response:
   html = response.read()
   print html

Answer 5

您的webPage变量是一个网络对象，实际上要使用html内容

content = webPage.read()

要获取标题标记的内容，您可以使用BeautifulSoup库

from bs4 import BeautifulSoup

htmlContent = webPage.read()
soup = BeautifulSoup(htmlContent, from_encoding=htmlContent.info().getparam('charset'))
heads = soup.find_all('head').text

现在heads是所有头标记

的内容列表

要了解有关BeautifulSoup库的更多信息，请访问：http://www.crummy.com/software/BeautifulSoup/bs4/doc/

从URL中提取HTML信息

5 个答案: