我必须编写一个程序来读取此链接中的HTML(http://python-data.dr-chuck.net/known_by_Maira.html),从锚标记中提取href =值,扫描相对于名字位于特定位置的标记在列表中,按照该链接重复该过程多次,并报告您找到的姓氏。
我应该在第18位找到链接(第一个名字是1),按照该链接重复此过程7次。答案是我检索的姓氏。
这是我找到的代码,它运行得很好。
import urllib
from BeautifulSoup import *
url = raw_input("Enter URL: ")
count = int(raw_input("Enter count: "))
position = int(raw_input("Enter position: "))
names = []
while count > 0:
print "retrieving: {0}".format(url)
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
tag = soup('a')
name = tag[position-1].string
names.append(name)
url = tag[position-1]['href']
count -= 1
print names[-1]
我真的很感激,如果有人可以向我解释,就像你对一个10岁的孩子一样,在while循环中发生了什么。我是Python的新手,非常感谢指导。
非常感谢你提前
答案 0 :(得分:1)
while count > 0: # because of `count -= 1` below,
# will run loop count times
print "retrieving: {0}".format(url) # just prints out the next web page
# you are going to get
page = urllib.urlopen(url) # urls reference web pages (well,
# many types of web content but
# we'll stick with web pages)
soup = BeautifulSoup(page) # web pages are frequently written
# in html which can be messy. this
# package "unmessifies" it
tag = soup('a') # in html you can highlight text and
# reference other web pages with <a>
# tags. this get all of the <a> tags
# in a list
name = tag[position-1].string # This gets the <a> tag at position-1
# and then gets its text value
names.append(name) # this puts that value in your own
# list.
url = tag[position-1]['href'] # html tags can have attributes. On
# and <a> tag, the href="something"
# attribute references another web
# page. You store it in `url` so that
# its the page you grab on the next
# iteration of the loop.
count -= 1
答案 1 :(得分:1)
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
total=0
url = input('Enter - ')
c=input('enter count-')
count=int(c)
p=input('enter position-')
pos=int(p)
while total<=count:
html = urllib.request.urlopen(url, context=ctx).read()
print("Retrieving",url)
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
counter=0
for tag in tags:
counter=counter+1
if(counter<=pos):
x=tag.get('href',None)
url=x
else:
break
total=total+1
答案 2 :(得分:0)
您可以输入要从页面中检索的网址数
0)打印网址 1)打开网址 2)读取源 BeautifulSoup docs
3)获取每个a
标签
4)得到整个<a ...></a>
我认为
5)将其添加到列表names
6)从names
的最后一项获取网址,即从href
中提取<a ...></a>
7)打印列表的最后一个names
答案 3 :(得分:-1)
[答案:我应该在第18位找到链接(名字是1),跟随该链接并重复此过程7次。答案是]