如何在页面中获取所有应用程序链接?

时间:2013-08-28 05:47:39

标签: python-2.7 beautifulsoup

我有这段代码:

from bs4 import BeautifulSoup
import urllib

url = 'http://www.brothersoft.com/windows/mp3_audio/midi_tools/'
html = urllib.urlopen(url)
soup = BeautifulSoup(html)

for a in soup.select('div.freeText dl a[href]'):
    print "http://www.borthersoft.com"+a['href'].encode('utf-8','replace')

我得到的是:

http://www.borthersoft.com/synthfont-159403.html
http://www.borthersoft.com/midi-maker-23747.html
http://www.borthersoft.com/keyboard-music-22890.html
http://www.borthersoft.com/mp3-editor-for-free-227857.html
http://www.borthersoft.com/midipiano---midi-file-player-recorder-61384.html
http://www.borthersoft.com/notation-composer-32499.html
http://www.borthersoft.com/general-midi-keyboard-165831.html
http://www.borthersoft.com/digital-music-mentor-31262.html
http://www.borthersoft.com/unisyn-250033.html
http://www.borthersoft.com/midi-maestro-13002.html
http://www.borthersoft.com/music-editor-free-139151.html
http://www.borthersoft.com/midi-converter-studio-46419.html
http://www.borthersoft.com/virtual-piano-65133.html
http://www.borthersoft.com/yamaha-9000-drumkit-282701.html
http://www.borthersoft.com/virtual-midi-keyboard-260919.html
http://www.borthersoft.com/anvil-studio-6269.html
http://www.borthersoft.com/midicutter-258103.html
http://www.borthersoft.com/softick-audio-gateway-55913.html
http://www.borthersoft.com/ipmidi-161641.html
http://www.borthersoft.com/d.accord-keyboard-chord-dictionary-28598.html

应该有526个应用程序链接打印出来。 但我只得到二十岁? 代码还不够?

1 个答案:

答案 0 :(得分:1)

页面中只有20个应用程序链接。

您必须迭代所有页面以获取所有链接:

from bs4 import BeautifulSoup
import urllib

for page in range(1, 27+1): # currently there are 27 pages.
    url = 'http://www.brothersoft.com/windows/mp3_audio/midi_tools/{}.html'.format(page)
    html = urllib.urlopen(url)
    soup = BeautifulSoup(html)

    for a in soup.select('div.freeText dl a[href]'):
        print "http://www.borthersoft.com"+a['href'].encode('utf-8','replace')