Question

我需要解析网址以获取链接到详细信息页面的网址列表。然后，从该页面我需要从该页面获取所有详细信息。我需要这样做，因为详细页面网址不会定期增加和更改，但事件列表页面保持不变。

基本上：

example.com/events/
    <a href="http://example.com/events/1">Event 1</a>
    <a href="http://example.com/events/2">Event 2</a>

example.com/events/1
    ...some detail stuff I need

example.com/events/2
    ...some detail stuff I need

Answer 1

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
    print anchor['href']

它将为您提供网址列表。现在，您可以迭代这些URL并解析数据。

inner_div = soup.findAll("div", {"id": "y-shade"}) 这是一个例子。您可以浏览BeautifulSoup教程。

Answer 2

对于遇到这种情况的下一群人，由于v3不再更新，BeautifulSoup已升级到v4。

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

在Python中使用...

import bs4 as BeautifulSoup

Answer 3

使用urllib2获取页面，然后使用漂亮的汤来获取链接列表，也可以尝试使用scraperwiki.com

编辑：

最近的发现：通过lxml与

一起使用BeautifulSoup

from lxml.html.soupparser import fromstring

比BeautifulSoup好几英里。它可以让你做dom.cssselect（'你的选择器'），这是一个救生员。只需确保安装了GoodSoup的好版本。 3.2.1是一种享受。

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]

Answer 4

完整PYTHON 3示例

包裹

# pip3 install urllib
# pip3 install beautifulsoup4

示例：

import urllib.request
from bs4 import BeautifulSoup

with urllib.request.urlopen('https://www.wikipedia.org/') as f:
    data = f.read().decode('utf-8')

d = BeautifulSoup(data)

d.title.string

上面应该打印出'Wikipedia'

美丽的汤解析网址以获取另一个网址数据

4 个答案: