使用BeautifulSoup跟踪HTML中的链接

时间:2017-02-08 14:15:50

标签: python python-3.x beautifulsoup

我正在做一门课程,要求我使用BeautifulSoup来解析它:http://python-data.dr-chuck.net/known_by_Fikret.html

说明如下:找到位置3的链接(名字是1)。请关注该链接。重复此过程4次。答案是您检索的姓氏。

这是我到目前为止的代码:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import re

url = input('Enter - ')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

count = int(input('Enter count: '))
pos = int(input('Enter position: ')) - 1
urllist = list()
taglist = list()

tags = soup('a')

for i in range(count):
    for tag in tags:
        taglist.append(tag)
    url = taglist[pos].get('href', None)
    print('Retrieving: ', url)
    urllist.append(url)
print('Last URL: ', urllist[-1])

这是我的输出:

Retrieving:  http://python-data.dr-chuck.net/known_by_Fikret.html 
Retrieving:  http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving:  http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving:  http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving:  http://python-data.dr-chuck.net/known_by_Montgomery.html
Last URL:  http://python-data.dr-chuck.net/known_by_Montgomery.html

这是我应该得到的输出:

Retrieving: http://python-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://python-data.dr-chuck.net/known_by_Mhairade.html
Retrieving: http://python-data.dr-chuck.net/known_by_Butchi.html
Retrieving: http://python-data.dr-chuck.net/known_by_Anayah.html
Last URL:  http://python-data.dr-chuck.net/known_by_Anayah.html

我已经在这方面工作了一段时间,但我仍然无法让代码正确循环。我是编码的新手,我只是在寻找一些帮助,指出我正确的方向。感谢。

8 个答案:

答案 0 :(得分:1)

尝试这种方式:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

url=input("Enter url:")

count=int(input('Enter count:'))
pos=int(input('Enter position:'))-1

urllist=list()

for i in range(count):
    html=urllib.request.urlopen(url)
    soup=BeautifulSoup(html,'html.parser')
    tags=soup('a')
    print('Retrieveing:',url)
    taglist=list()
    for tag in tags:
        y=tag.get('href',None)
        taglist.append(y)

    url=taglist[pos]

    urllist.append(url)

print("Last Url:",urllist[-2])

答案 1 :(得分:0)

您多次获得相同pos位置的链接。使用i循环计数器作为偏移量,替换:

url = taglist[pos].get('href', None)

使用:

url = taglist[pos + i].get('href', None)

答案 2 :(得分:0)

您没有得到正确答案的原因如下:您没有打开链接。

在第一页中找到正确的网址后,您必须打开您使用urllib.request.urlopen(URL).read()找到的网址,然后在那里查找新链接。你必须重复这三次。我建议使用while循环。

这段代码可以解决问题:

url =  'http://python-data.dr-chuck.net/known_by_Fikret.html'
count = 5
pos = 2
urllist = []
taglist = []

connections = 0 
while connections < 5 : #you need to connect five times
    taglist = []
    print('Retrieving: ', url)
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')

    for i in range(count):
        for tag in tags:
            taglist.append(tag)

    url = taglist[pos].get('href', None)
    urllist.append(url)

    connections = connections + 1  
print ("last url:", url)

答案 3 :(得分:0)

def get_html(url):
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')
    return soup

url = input('Enter - ')
count = int(input('Enter count: '))
pos = int(input('Enter position: ')) - 1

urllist = list()

 for i in range(count):
    taglist = list()

    for tag in get_html(url)('a'): # Needed to update your variable to new url html
        taglist.append(tag)

     url = taglist[pos].get('href', None) # You grabbed url but never updated your tags variable.

    print('Retrieving: ', url)
    urllist.append(url)

 print('Last URL: ', urllist[-1])

答案 4 :(得分:0)

尝试这个:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

def parse(url):
    count=0
    while count<7:
        html = urllib.request.urlopen(url, context=ctx).read()
        soup = BeautifulSoup(html, 'html.parser')
        list1=list()
        tags = soup('a')
        for tag in tags:
            list1.append(tag.get('href', None))
        url=list1[17]
        count+=1
        print ('Retreiving:',url)

print (parse('http://py4e-data.dr-chuck.net/known_by_Lorenz.html'))    

这是我的输出:

Retreiving: http://py4e-data.dr-chuck.net/known_by_Cadyn.html
Retreiving: http://py4e-data.dr-chuck.net/known_by_Phebe.html
Retreiving: http://py4e-data.dr-chuck.net/known_by_Cullen.html
Retreiving: http://py4e-data.dr-chuck.net/known_by_Alessandro.html
Retreiving: http://py4e-data.dr-chuck.net/known_by_Gurveer.html
Retreiving: http://py4e-data.dr-chuck.net/known_by_Anureet.html
Retreiving: http://py4e-data.dr-chuck.net/known_by_Sandie.html
None

答案 5 :(得分:0)

    .lazyload.img-responsive.wp-post-image
{
    width="1200" !important;
}

答案 6 :(得分:0)

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl


# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

urllist = list()
url = input('Enter - ')
count = int(input('Enter count: '))
pos = int(input('Enter position: ')) - 1

for i in range(count):
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags=soup('a')
    url = tags[pos].get('href', None) 
    print('Retrieving: ', url)
    urllist.append(url)


print('Retrieving: ', urllist[-1])

答案 7 :(得分:0)

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
url=input("Enter url:")
count=int(input('Enter count:'))
pos=int(input('Enter position:'))-1
urllist=list()
for i in range(count):
    html=urllib.request.urlopen(url)
    soup=BeautifulSoup(html,'html.parser')
    tags=soup('a')
    print('Retrieveing:',url)
    taglist=list()
    for tag in tags:
        y=tag.get('href',None)
        taglist.append(y)
    url=taglist[pos]
    urllist.append(url)
x=len(urllist)
print("Last Url:",urllist[x-1])