我正在做一门课程,要求我使用BeautifulSoup来解析它:http://python-data.dr-chuck.net/known_by_Fikret.html
说明如下:找到位置3的链接(名字是1)。请关注该链接。重复此过程4次。答案是您检索的姓氏。
这是我到目前为止的代码:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import re
url = input('Enter - ')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
count = int(input('Enter count: '))
pos = int(input('Enter position: ')) - 1
urllist = list()
taglist = list()
tags = soup('a')
for i in range(count):
for tag in tags:
taglist.append(tag)
url = taglist[pos].get('href', None)
print('Retrieving: ', url)
urllist.append(url)
print('Last URL: ', urllist[-1])
这是我的输出:
Retrieving: http://python-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://python-data.dr-chuck.net/known_by_Montgomery.html
Last URL: http://python-data.dr-chuck.net/known_by_Montgomery.html
这是我应该得到的输出:
Retrieving: http://python-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://python-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://python-data.dr-chuck.net/known_by_Mhairade.html
Retrieving: http://python-data.dr-chuck.net/known_by_Butchi.html
Retrieving: http://python-data.dr-chuck.net/known_by_Anayah.html
Last URL: http://python-data.dr-chuck.net/known_by_Anayah.html
我已经在这方面工作了一段时间,但我仍然无法让代码正确循环。我是编码的新手,我只是在寻找一些帮助,指出我正确的方向。感谢。
答案 0 :(得分:1)
尝试这种方式:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
url=input("Enter url:")
count=int(input('Enter count:'))
pos=int(input('Enter position:'))-1
urllist=list()
for i in range(count):
html=urllib.request.urlopen(url)
soup=BeautifulSoup(html,'html.parser')
tags=soup('a')
print('Retrieveing:',url)
taglist=list()
for tag in tags:
y=tag.get('href',None)
taglist.append(y)
url=taglist[pos]
urllist.append(url)
print("Last Url:",urllist[-2])
答案 1 :(得分:0)
您多次获得相同pos
位置的链接。使用i
循环计数器作为偏移量,替换:
url = taglist[pos].get('href', None)
使用:
url = taglist[pos + i].get('href', None)
答案 2 :(得分:0)
您没有得到正确答案的原因如下:您没有打开链接。
在第一页中找到正确的网址后,您必须打开您使用urllib.request.urlopen(URL).read()找到的网址,然后在那里查找新链接。你必须重复这三次。我建议使用while循环。
这段代码可以解决问题:
url = 'http://python-data.dr-chuck.net/known_by_Fikret.html'
count = 5
pos = 2
urllist = []
taglist = []
connections = 0
while connections < 5 : #you need to connect five times
taglist = []
print('Retrieving: ', url)
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
for i in range(count):
for tag in tags:
taglist.append(tag)
url = taglist[pos].get('href', None)
urllist.append(url)
connections = connections + 1
print ("last url:", url)
答案 3 :(得分:0)
def get_html(url):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
return soup
url = input('Enter - ')
count = int(input('Enter count: '))
pos = int(input('Enter position: ')) - 1
urllist = list()
for i in range(count):
taglist = list()
for tag in get_html(url)('a'): # Needed to update your variable to new url html
taglist.append(tag)
url = taglist[pos].get('href', None) # You grabbed url but never updated your tags variable.
print('Retrieving: ', url)
urllist.append(url)
print('Last URL: ', urllist[-1])
答案 4 :(得分:0)
尝试这个:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
def parse(url):
count=0
while count<7:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
list1=list()
tags = soup('a')
for tag in tags:
list1.append(tag.get('href', None))
url=list1[17]
count+=1
print ('Retreiving:',url)
print (parse('http://py4e-data.dr-chuck.net/known_by_Lorenz.html'))
这是我的输出:
Retreiving: http://py4e-data.dr-chuck.net/known_by_Cadyn.html
Retreiving: http://py4e-data.dr-chuck.net/known_by_Phebe.html
Retreiving: http://py4e-data.dr-chuck.net/known_by_Cullen.html
Retreiving: http://py4e-data.dr-chuck.net/known_by_Alessandro.html
Retreiving: http://py4e-data.dr-chuck.net/known_by_Gurveer.html
Retreiving: http://py4e-data.dr-chuck.net/known_by_Anureet.html
Retreiving: http://py4e-data.dr-chuck.net/known_by_Sandie.html
None
答案 5 :(得分:0)
.lazyload.img-responsive.wp-post-image
{
width="1200" !important;
}
答案 6 :(得分:0)
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
urllist = list()
url = input('Enter - ')
count = int(input('Enter count: '))
pos = int(input('Enter position: ')) - 1
for i in range(count):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags=soup('a')
url = tags[pos].get('href', None)
print('Retrieving: ', url)
urllist.append(url)
print('Retrieving: ', urllist[-1])
答案 7 :(得分:0)
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
url=input("Enter url:")
count=int(input('Enter count:'))
pos=int(input('Enter position:'))-1
urllist=list()
for i in range(count):
html=urllib.request.urlopen(url)
soup=BeautifulSoup(html,'html.parser')
tags=soup('a')
print('Retrieveing:',url)
taglist=list()
for tag in tags:
y=tag.get('href',None)
taglist.append(y)
url=taglist[pos]
urllist.append(url)
x=len(urllist)
print("Last Url:",urllist[x-1])