美丽的汤:降序链接匹配模式的链接

时间:2016-12-10 21:30:56

标签: python-2.7 beautifulsoup

我知道这是非常环保的,但我试图降低网站中的链接,并希望能够降低链接链接的链接,并要求每个阶段的链接遵循一些简单的模式匹配。我已经看过一些关于显示链接的教程,但没有看到链接的模式匹配或降序链接。一些帮助将不胜感激。

例如在这种情况下:

from bs4 import BeautifulSoup
import urllib2

resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

输出:

/contact-gpsbasecamp.php
/privacy-policy.php
/terms-of-service.php
/
                National-Parks/map
/National-Historic-Parks
/National-Historic-Sites
/National-Monuments
/Other-NPS-Facilities
national-parks/Acadia_National_Park
national-parks/Arches_National_Park
national-parks/Badlands_National_Park
national-parks/Big_Bend_National_Park
national-parks/Biscayne_National_Park
national-parks/Black_Canyon_Of_The_Gunnison_National_Park
national-parks/Bryce_Canyon_National_Park
national-parks/Canyonlands_National_Park
national-parks/Capitol_Reef_National_Park
national-parks/Carlsbad_Caverns_National_Park
national-parks/Channel_Islands_National_Park
national-parks/Congaree_National_Park
national-parks/Crater_Lake_National_Park
national-parks/Cuyahoga_Valley_National_Park
national-parks/Death_Valley_National_Park
national-parks/Denali_National_Park_and_Preserve
national-parks/Dry_Tortugas_National_Park
national-parks/Everglades_National_Park
national-parks/Gates_Of_The_Arctic_National_Park_and_Preserve
national-parks/Glacier_Bay_National_Park_and_Preserve
national-parks/Glacier_National_Park
national-parks/Grand_Canyon_National_Park
national-parks/Grand_Teton_National_Park
national-parks/Great_Basin_National_Park
national-parks/Great_Smoky_Mountains_National_Park
national-parks/Guadalupe_Mountains_National_Park
national-parks/Haleakala_National_Park
national-parks/Hawaii_Volcanoes_National_Park
national-parks/Hot_Springs_National_Park
national-parks/Isle_Royale_National_Park
national-parks/Joshua_Tree_National_Park
national-parks/Katmai_National_Park_and_Preserve
national-parks/Kenai_Fjords_National_Park
national-parks/Kings_Mountain_National_Military_Park
national-parks/Kobuk_Valley_National_Park
national-parks/Lake_Clark_National_Park_and_Preserve
national-parks/Lassen_Volcanic_National_Park
national-parks/Mammoth_Cave_National_Park
national-parks/Mesa_Verde_National_Park
national-parks/Mount_Rainier_National_Park
national-parks/National_Park_of_American_Samoa
national-parks/National_Parks_of_New_York_Harbor
national-parks/North_Cascades_National_Park
national-parks/Olympic_National_Park
national-parks/Petrified_Forest_National_Park
national-parks/Redwood_National_and_State_Parks
national-parks/Rocky_Mountain_National_Park
national-parks/Saguaro_National_Park
national-parks/Sequoia_and_Kings_Canyon_National_Parks
national-parks/Shenandoah_National_Park
national-parks/Theodore_Roosevelt_National_Park
national-parks/Virgin_Islands_National_Park
national-parks/Voyageurs_National_Park
national-parks/Wind_Cave_National_Park
national-parks/Wolf_Trap_National_Park_for_the_Performing_Arts
national-parks/Wrangell_-_St_Elias_National_Park_and_Preserve
national-parks/Yellowstone_National_Park
national-parks/Yosemite_National_Park
national-parks/Zion_National_Park
http://www.gpsbasecamp.com
http://www.gpsbasecamp.com
/upload-gps-file.php
/download-gps-file.php
/national-parks
/state-parks


/mp3/index.php

如何下​​载包含"国家公园"的所有链接从下一级链接中获取信息?

感谢您的帮助!

2 个答案:

答案 0 :(得分:1)

方法1:

for link in soup.select('a[href^="national-parks"]'):
        print(link['href'])

方法2:

import re
for link in soup.find_all('a', href=re.compile(r"^national-parks")):
    print(link['href'])

这两种方法将匹配以'national-parks'开头的href

出:

national-parks/Acadia_National_Park
national-parks/Arches_National_Park
national-parks/Badlands_National_Park
national-parks/Big_Bend_National_Park
national-parks/Biscayne_National_Park
national-parks/Black_Canyon_Of_The_Gunnison_National_Park
national-parks/Bryce_Canyon_National_Park
national-parks/Canyonlands_National_Park
national-parks/Capitol_Reef_National_Park
national-parks/Carlsbad_Caverns_National_Park
national-parks/Channel_Islands_National_Park
national-parks/Congaree_National_Park
national-parks/Crater_Lake_National_Park
national-parks/Cuyahoga_Valley_National_Park
national-parks/Death_Valley_National_Park
national-parks/Denali_National_Park_and_Preserve
national-parks/Dry_Tortugas_National_Park
national-parks/Everglades_National_Park
national-parks/Gates_Of_The_Arctic_National_Park_and_Preserve
national-parks/Glacier_Bay_National_Park_and_Preserve
national-parks/Glacier_National_Park
national-parks/Grand_Canyon_National_Park
national-parks/Grand_Teton_National_Park
national-parks/Great_Basin_National_Park
national-parks/Great_Smoky_Mountains_National_Park
national-parks/Guadalupe_Mountains_National_Park
national-parks/Haleakala_National_Park
national-parks/Hawaii_Volcanoes_National_Park
national-parks/Hot_Springs_National_Park
national-parks/Isle_Royale_National_Park
national-parks/Joshua_Tree_National_Park
national-parks/Katmai_National_Park_and_Preserve
national-parks/Kenai_Fjords_National_Park
national-parks/Kings_Mountain_National_Military_Park
national-parks/Kobuk_Valley_National_Park
national-parks/Lake_Clark_National_Park_and_Preserve
national-parks/Lassen_Volcanic_National_Park
national-parks/Mammoth_Cave_National_Park
national-parks/Mesa_Verde_National_Park
national-parks/Mount_Rainier_National_Park
national-parks/National_Park_of_American_Samoa
national-parks/National_Parks_of_New_York_Harbor
national-parks/North_Cascades_National_Park
national-parks/Olympic_National_Park
national-parks/Petrified_Forest_National_Park
national-parks/Redwood_National_and_State_Parks
national-parks/Rocky_Mountain_National_Park
national-parks/Saguaro_National_Park
national-parks/Sequoia_and_Kings_Canyon_National_Parks
national-parks/Shenandoah_National_Park
national-parks/Theodore_Roosevelt_National_Park
national-parks/Virgin_Islands_National_Park
national-parks/Voyageurs_National_Park
national-parks/Wind_Cave_National_Park
national-parks/Wolf_Trap_National_Park_for_the_Performing_Arts
national-parks/Wrangell_-_St_Elias_National_Park_and_Preserve
national-parks/Yellowstone_National_Park
national-parks/Yosemite_National_Park
national-parks/Zion_National_Park

答案 1 :(得分:0)

我认为这是您正在寻找的功能: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments

from bs4 import BeautifulSoup
import urllib2
import re

resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset'))

nat_parks_linkns = [link['href'] for link in soup.find_all((href=re.compile("national-parks"))]

然后您可以再次访问每个链接。 (我还没有真正测试上面的代码)