如何循环标记并重定向以检索更多标记?

时间:2015-10-29 19:29:40

标签: python html web-scraping

出于教育目的,我正在尝试编写一个程序,提示用户输入“url”,“count”和“position”。将删除“url”并且将检索“url”内的“标签”,这将产生“标签”列表。然后使用“位置”从先前检索的“标签”列表中选择新链接,并将其用作要刮取的新“URL”。 “计数”是此过程发生的次数。

Code:
import urllib
from bs4 import BeautifulSoup as bfs

# Declare global variables
href_list = []
no_iterations = 0

# Prompt user for input
url = raw_input('Enter url - ')
count = raw_input('Enter count - ')
position = raw_input('Enter position - ')

# While loop with condition
while no_iterations != int(count):
    no_iterations += 1

    # Scraping the url 
    html = urllib.urlopen(url).read()
    soup = bfs(html)

    # Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        href_list.append(tag.get('href', None))

    # Assiginig new url
    url = href_list[int(position)-1]

    # Printing info for user
    print 'Retrieving:', href_list[int(position)-1]
print 'Last Url:', href_list[int(position)-1]

当我在这里运行程序时,我得到的是:

Enter url - http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html 
Enter count - 4
Enter position - 3 

Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Last Url: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html

通过观察输出,我可以看到URL没有按原样重置,任何建议都表示赞赏。

2 个答案:

答案 0 :(得分:1)

我通过重置列表解决了我存储检索到的标签 代码:

import urllib
from bs4 import BeautifulSoup as bfs

# Declare global variables
href_list = []
no_iterations = 0

# Prompt user for input
url = raw_input('Enter url - ')
count = raw_input('Enter count - ')
position = raw_input('Enter position - ')

# While loop with condition
    while no_iterations != int(count):
    no_iterations += 1

    # Scraping the url 
    html = urllib.urlopen(url).read()
    soup = bfs(html)

    # Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        href_list.append(tag.get('href', None))

    # Assiginig new url
    url = href_list[int(position)-1]
    href_list = []
    # Printing info for user
    print 'Retrieving:', href_list[int(position)-1]
print 'Last Url:', url

所以新输出现在是:

Enter url - http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html 
Enter count - 4
Enter position - 3
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Montgomery.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Mhairade.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Butchi.html
Retrieving: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Anayah.html
Last Url: http://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Anayah.html

感谢您的支持

答案 1 :(得分:0)

修改过的Output代码并可以使用它!

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup as bfs
#global variables
href_list = []
no_iterations = 0
# Prompt user for input
url = input('Enter url:')
count = input('Enter no. of iterations: ')
position = input('Enter start position ')
# While loop with condition
while no_iterations != int(count):
    no_iterations += 1
    # Scraping the url 
    soup = bfs((urllib.request.urlopen(url).read()),'html.parser')
    # Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        href_list.append(tag.get('href'))
    # Assiginig new url
    url = href_list[int(position)-1]
     # Printing info for user
    print ('Retrieving:', url)
    href_list = []
print ('Last Url:', url)