从站点中提取链接,并在Python2.7中关注其中一个链接

时间:2016-04-20 11:01:35

标签: python loops url beautifulsoup

我想创建一段代码,其工作方式如下: 你给它一个URL,它在那个网页上查看有多少链接,一个,再看一下新网页,一个链接,依此类推。

我有一段代码可以打开网页,搜索链接并从中创建一个列表:

import urllib
from bs4 import BeautifulSoup
list_links = []
page = raw_input('enter an url')
url = urllib.urlopen(page).read()
html = BeautifulSoup(url, 'html.parser')
for link in html.find_all('a'):
    link = link.get('href')
    list_links.append(link)

接下来,我希望用户决定要关注哪个链接,所以我有这个:

link_number = len(list_links)
print 'enter a number between 0 and', (link_number)
number = raw_input('')

for number in number:
    if int(number) < 0 or int(number) > link_number:
        print "The End."
        break
    else:
        continue

url_2 = urllib.urlopen(list_links[int(number)]).read()

我的代码崩溃了

理想情况下,我希望有一个无休止的过程(通过输入错误的号码来通知用户会阻止它):打开页面 - &gt;计算链接数量 - &gt;选择一个 - &gt;点击此链接并打开新页面 - &gt;计算链接数量......

有人能帮助我吗?

2 个答案:

答案 0 :(得分:0)

你可以尝试使用它(对不起,如果它不是很漂亮,我写得有点匆忙):

import requests, random
from bs4 import BeautifulSoup as BS
from time import sleep


def main(url):
    content = scraping_call(url)
    if not content:
        print "Couldn't get html..."
        return
    else:
        links_list = []
        soup  = BS(content, 'html5lib')
        for link in soup.findAll('a'):
            try:
                links_list.append(link['href'])
            except KeyError:
                continue

        chosen_link_index = input("Enter a number between 0 and %d: " % len(links_list))
        if not 0 < chosen_link_index <= len(links_list):
            raise ValueError ('Number must be between 0 and %d: ' % len(links_list))
            #script will crash here. 
            #If you want the user to try again, you can
            #set up a nr of attempts, like in scraping_call()
        else:
            #if user wants to stop the infinite loop 
            next_step = raw_input('Continue or exit? (Y/N) ') or 'Y'
            # default value is 'yes' so if u want to continue, 
            #just press Enter
            if next_step.lower() == 'y':
                main(links_list[chosen_link_index])
            else:
                return



def scraping_call(url):
    attempt = 1
    while attempt < 6:
        try:
            page = requests.get(url)
            if page.status_code == 200:
                result = page.content
            else:
                result = ''
        except Exception,e:
            result = ''
            print 'Failed attempt (',attempt,'):', e
            attempt += 1
            sleep(random.randint(2,4))
            continue
        return result


if __name__ == '__main__':
    main('enter the starting URL here')

答案 1 :(得分:0)

某个网页中的某些链接可能以相对地址的形式出现,我们需要考虑到这一点。 这应该可以解决问题。适用于python 3.4。

from urllib.request import urlopen
from urllib.parse import urljoin, urlsplit
from bs4 import BeautifulSoup

addr = input('enter an initial url: ')

while True:
    html = BeautifulSoup(urlopen(addr).read(), 'html.parser')
    list_links = []
    num = 0
    for link in html.find_all('a'):
        url = link.get('href')
        if not urlsplit(url).netloc:
            url = urljoin(addr, url)
        if urlsplit(url).scheme in ['http', 'https']:
            print("%d : %s " % (num, str(url)))
            list_links.append(url)
            num += 1

    idx = int(input("enter an index between 0 and %d: " % (len(list_links) - 1)))
    if not 0 <= idx < len(list_links):
        raise ValueError('Number must be between 0 and %d: ' % len(list_links))
    addr = list_links[idx]