我需要一些有关我的课程课程的帮助,目标如下:在位置18(名字为1)找到链接。点击该链接。重复此过程7次。答案是您检索的姓氏。 提示:您要加载的最后一页名称的第一个字符是:链接:({http://py4e-data.dr-chuck.net/known_by_Shannon.html)
中的J我已经为此任务编写了代码,但似乎它仅适用于第一个项目,并且自第一个项目以来的每个站点,代码列表均出现故障。我的想法是获取HTML代码并将URL附加到列表中,然后从列表中找到第18个项目,然后使用新URL重定向整个循环并删除旧列表。重复该过程7次。我对代码究竟在哪里出错感到严重困惑。预先感谢。
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
term_counter = (0)
file = list()
regex = list()
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
for I in range(7) :
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
del file[:]
file = list()
for tag in tags :
file.append(tag)
print(tag.contents[0])
url = tag.get('href')
print (url)
for items in range(17,18) :
print(file[items])
我是该网站的新手,我不确定这是否是询问python问题的正确地方,如果不是,请通知我,我将其重新发布到正确的位置。
答案 0 :(得分:0)
您可以尝试以下方法:
import bs4
import requests
url = 'http://py4e-data.dr-chuck.net/known_by_Shannon.html'
for I in range(7) :
session = requests.get(url)
soup = bs4.BeautifulSoup(session.text, "html.parser")
target_list = soup.findAll("li")[17]
href = target_list.find("a").get("href")
link_name = target_list.find("a").text
if(I == 6):
print('Solution is ' + link_name)
else:
print('Name is: ' + link_name)
print('Going to web is: ' + href)
url = href
答案将是最后打印的链接名称。
编辑:
答案 1 :(得分:0)
问题是,尽管您打印出第18个tags
条目,但没有将url
设置为该值。不幸的是,您在url
循环中也使用了tags
,所以您不会注意到此错误。在您的代码中,url
仍设置为tags
的最后一个条目。如果使用循环中使用的实际I
打印出url
(取消注释相应的行),您将看到以下内容:
0 http://py4e-data.dr-chuck.net/known_by_Shannon.html
['Riya', 'Terri', 'Coban', 'Oswald', 'Codie', 'Arshjoyat', 'Carli', 'Aieecia', 'Ronnie', 'Yelena', 'Abid', 'Prithvi', 'Ellenor', 'Shayla', 'Chala', 'Nelson', 'Chaitanya', 'Stacey', 'Karis', 'Mariyah', 'Jamie', 'Breeanna', 'Kendall', 'Adelaide', 'Aimiee', 'Manwen', 'Dennys', 'Benjamyn', 'Reynelle', 'Jesuseun', 'Malik', 'Brigitte', 'Farah', 'Youcef', 'Ruqayah', 'Mili', 'Caitaidh', 'Raul', 'Katelyn', 'Yakup', 'Cohan', 'Lylakay', 'Dougray', 'Silvana', 'Roxanne', 'Tanchoma', 'Andie', 'Aarman', 'Kyalah', 'Tayyab', 'Malikah', 'Bo', 'Oona', 'Daniil', 'Wardah', 'Jessamy', 'Karly', 'Tala', 'Ilyaas', 'Maram', 'Ruaidhri', 'Donna', 'Liza', 'Aileigh', 'Muzzammil', 'Chi', 'Serafina', 'Abbas', 'Rhythm', 'Jonny', 'Niraj', 'Ciara', 'Kylen', 'Demmi', 'Christianna', 'Tanzina', 'Brianna', 'Kevyn', 'Hariot', 'Maisie', 'Naideen', 'Nicolas', 'Suvi', 'Areeb', 'Kiranpreet', 'Rachna', 'Umme', 'Caela', 'Miao', 'Tansy', 'Miah', 'Luciano', 'Karolina', 'Rivan', 'Cavan', 'Benn', 'Haydn', 'Zaina', 'Rafi', 'Ahmad']
<a href="http://py4e-data.dr-chuck.net/known_by_Stacey.html">Stacey</a>
1 http://py4e-data.dr-chuck.net/known_by_Ahmad.html
['Tilhi', 'Rachel', 'Latif', 'Deryn', 'Pawel', 'Anna', 'Blake', 'Brehme', 'Jo', 'Laurajane', 'Khayla', 'Declyan', 'Graidi', 'Foosiya', 'Nabeeha', 'Otilija', 'Dougal', 'Adeena', 'Alfie', 'Angali', 'Lilah', 'Saadah', 'Kelam', 'Kensey', 'Tabitha', 'Peregrine', 'Abdisalam', 'Presley', 'Allegria', 'Harish', 'Arshjoyat', 'Hussan', 'Sammy', 'Ama', 'Leydon', 'Anndra', 'Anselm', 'Logyne', 'Fion', 'Jacqui', 'Reggie', 'Mounia', 'Pedro', 'Hussain', 'Raina', 'Inka', 'Shaylee', 'Riya', 'Phebe', 'Uzayr', 'Isobella', 'Abdulkadir', 'Johndean', 'Charlotte', 'Moray', 'Saraah', 'Liana', 'Keane', 'Maros', 'Robi', 'Rowanna', 'Wesley', 'Maddox', 'Annica', 'Oluwabukunmi', 'Jiao', 'Nyomi', 'Hamish', 'Bushra', 'Marcia', 'Rimal', 'Kaceylee', 'Limo', 'Dela', 'Cal', 'Rhudi', 'Komal', 'Stevey', 'Amara', 'Nate', 'Roma', 'Fatou', 'Marykate', 'Abiya', 'Bay', 'Kati', 'Carter', 'Niraj', 'Maisum', 'Jaz', 'Coban', 'Harikrishna', 'Armani', 'Muir', 'Ilsa', 'Benjamyn', 'Russel', 'Emerson', 'Rehaan', 'Veronica']
<a href="http://py4e-data.dr-chuck.net/known_by_Adeena.html">Adeena</a>
为避免此问题,您必须为第18个条目的下一个循环设置url
:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
#import re
term_counter = (0)
file = list()
#regex = list()
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
for _I in range(7) :
#print(_I, url) <- this prints out the _I value of the loop and the url used in this round
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
#print([item.contents[0] for item in tags]) <- this prints out a list of all names on this page
file = list()
for tag in tags :
file.append(tag)
#this is the last url you used in your code for the next _I loop
url = tag.get('href')
#so we have to redefine url as the 18th entry in your list for the next _I loop round
url = file[17].get("href")
print("The next url we will use is {}".format(url))
答案 2 :(得分:0)
在我看来,班级的问题乞求一种递归方法:
import bs4
import requests
url = 'http://py4e-data.dr-chuck.net/known_by_Shannon.html'
def get_url(url):
page_response = requests.get(url)
soup = BeautifulSoup(page_response.content, "html.parser")
return soup.find_all('a')[17].get('href')
def routine(url, counter):
if counter > 0:
url = get_url(url)
print(url)
counter -= 1
routine(url, counter)
routine(url, 7)
输出:
http://py4e-data.dr-chuck.net/known_by_Stacey.html
http://py4e-data.dr-chuck.net/known_by_Zoya.html
http://py4e-data.dr-chuck.net/known_by_Haiden.html
http://py4e-data.dr-chuck.net/known_by_Ayrton.html
http://py4e-data.dr-chuck.net/known_by_Leilan.html
http://py4e-data.dr-chuck.net/known_by_Thorben.html
http://py4e-data.dr-chuck.net/known_by_Jahy.html
确认最后一个URL名称的第一个字符以'J'开头。