Question

我正在尝试使用python获取网站上的所有网址。目前我只是将网站html复制到python程序中，然后使用代码提取所有网址。有没有办法我可以直接从网上做到这一点而无需复制整个HTML？

Answer 1

您只需使用requests和BeautifulSoup的组合。

首先使用HTTP发出requests请求以获取HTML内容。您将把它作为Python字符串获取，您可以根据需要进行操作。
获取HTML内容字符串并将其提供给BeautifulSoup，DOM已完成所有工作以提取<a>，并获取所有网址，即import requests from bs4 import BeautifulSoup, SoupStrainer response = requests.get('http://stackoverflow.com') html_str = response.text bs = BeautifulSoup(html_str, parseOnlyThese=SoupStrainer('a')) for a_element in bs: if a_element.has_attr('href'): print(a_element['href'])元素。

以下是如何从StackOverflow获取所有链接的示例：

/questions/tagged/facebook-javascript-sdk
/questions/31743507/facebook-app-request-dialog-keep-loading-on-mobile-after-fb-login-called
/users/3545752/user3545752
/questions/31743506/get-nuspec-file-for-existing-nuget-package
/questions/tagged/nuget
...

示例输出：

h1, .h1, h2, .h2, h3, .h3 {
    margin-bottom: 10px;
    margin-top: 20px;
}

Answer 2

如果你使用的是python2，那么最简单的可能是urllib.urlopen;如果你正在使用python3，则最简单可能是urllib.request.urlopen（你必须先import urllib或import urllib.request当然）。这样你就可以获得一个像对象这样的文件，你可以从中读取（即f.read()）html文档。

python 2的示例：

import urllib

f = urlopen("http://stackoverflow.com")

http_document = f.read()
f.close()

好消息是，您似乎已经完成了分析链接的html文档的困难部分。

Answer 3

在Python 2中，您可以使用urllib2.urlopen：

import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()

在Python 3中，您可以使用urllib.request.urlopen：

import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
    html = response.read()

如果您必须执行更复杂的任务，如身份验证或传递参数，我建议您查看requests库。

Answer 4

您可能想要使用bs4（BeautifulSoup）库。

Beautiful Soup是一个Python库，用于从HTML和XML文件中提取数据。

您可以在cmd行使用followig命令下载bs4。 pip install BeautifulSoup4

import urllib2
import urlparse
from bs4 import BeautifulSoup

url = "http://www.google.com"
response = urllib2.urlopen(url)
content = response.read()

soup = BeautifulSoup(content, "html.parser")
for link in soup.find_all('a', href=True):
    print urlparse.urljoin(url, link['href'])

访问python中的网站

4 个答案: