出于教育目的,我正在尝试编写一个程序,提示用户输入“url”,“count”和“position”。将删除“url”并且将检索“url”内的“标签”,这将产生“标签”列表。然后使用“位置”从先前检索的“标签”列表中选择新链接,并将其用作要刮取的新“URL”。 “计数”是此过程发生的次数。
Code:
import urllib
from bs4 import BeautifulSoup as bfs
# Declare global variables
href_list = []
no_iterations = 0
# Prompt user for input
url = raw_input('Enter url - ')
count = raw_input('Enter count - ')
position = raw_input('Enter position - ')
# While loop with condition
while no_iterations != int(count):
no_iterations += 1
# Scraping the url
html = urllib.urlopen(url).read()
soup = bfs(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
href_list.append(tag.get('href', None))
# Assiginig new url
url = href_list[int(position)-1]
# Printing info for user
print 'Retrieving:', href_list[int(position)-1]
print 'Last Url:', href_list[int(position)-1]
当我在这里运行程序时,我得到的是:
Enter url - http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
Enter count - 4
Enter position - 3
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Last Url: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
通过观察输出,我可以看到URL没有按原样重置,任何建议都表示赞赏。
答案 0 :(得分:1)
我通过重置列表解决了我存储检索到的标签 代码:
import urllib
from bs4 import BeautifulSoup as bfs
# Declare global variables
href_list = []
no_iterations = 0
# Prompt user for input
url = raw_input('Enter url - ')
count = raw_input('Enter count - ')
position = raw_input('Enter position - ')
# While loop with condition
while no_iterations != int(count):
no_iterations += 1
# Scraping the url
html = urllib.urlopen(url).read()
soup = bfs(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
href_list.append(tag.get('href', None))
# Assiginig new url
url = href_list[int(position)-1]
href_list = []
# Printing info for user
print 'Retrieving:', href_list[int(position)-1]
print 'Last Url:', url
所以新输出现在是:
Enter url - http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
Enter count - 4
Enter position - 3
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Mhairade.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Butchi.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Anayah.html
Last Url: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Anayah.html
感谢您的支持
答案 1 :(得分:0)
修改过的Output代码并可以使用它!
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup as bfs
#global variables
href_list = []
no_iterations = 0
# Prompt user for input
url = input('Enter url:')
count = input('Enter no. of iterations: ')
position = input('Enter start position ')
# While loop with condition
while no_iterations != int(count):
no_iterations += 1
# Scraping the url
soup = bfs((urllib.request.urlopen(url).read()),'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
href_list.append(tag.get('href'))
# Assiginig new url
url = href_list[int(position)-1]
# Printing info for user
print ('Retrieving:', url)
href_list = []
print ('Last Url:', url)