我想创建一段代码,其工作方式如下: 你给它一个URL,它在那个网页上查看有多少链接,一个,再看一下新网页,一个链接,依此类推。
我有一段代码可以打开网页,搜索链接并从中创建一个列表:
import urllib
from bs4 import BeautifulSoup
list_links = []
page = raw_input('enter an url')
url = urllib.urlopen(page).read()
html = BeautifulSoup(url, 'html.parser')
for link in html.find_all('a'):
link = link.get('href')
list_links.append(link)
接下来,我希望用户决定要关注哪个链接,所以我有这个:
link_number = len(list_links)
print 'enter a number between 0 and', (link_number)
number = raw_input('')
for number in number:
if int(number) < 0 or int(number) > link_number:
print "The End."
break
else:
continue
url_2 = urllib.urlopen(list_links[int(number)]).read()
我的代码崩溃了
理想情况下,我希望有一个无休止的过程(通过输入错误的号码来通知用户会阻止它):打开页面 - &gt;计算链接数量 - &gt;选择一个 - &gt;点击此链接并打开新页面 - &gt;计算链接数量......
有人能帮助我吗?
答案 0 :(得分:0)
你可以尝试使用它(对不起,如果它不是很漂亮,我写得有点匆忙):
import requests, random
from bs4 import BeautifulSoup as BS
from time import sleep
def main(url):
content = scraping_call(url)
if not content:
print "Couldn't get html..."
return
else:
links_list = []
soup = BS(content, 'html5lib')
for link in soup.findAll('a'):
try:
links_list.append(link['href'])
except KeyError:
continue
chosen_link_index = input("Enter a number between 0 and %d: " % len(links_list))
if not 0 < chosen_link_index <= len(links_list):
raise ValueError ('Number must be between 0 and %d: ' % len(links_list))
#script will crash here.
#If you want the user to try again, you can
#set up a nr of attempts, like in scraping_call()
else:
#if user wants to stop the infinite loop
next_step = raw_input('Continue or exit? (Y/N) ') or 'Y'
# default value is 'yes' so if u want to continue,
#just press Enter
if next_step.lower() == 'y':
main(links_list[chosen_link_index])
else:
return
def scraping_call(url):
attempt = 1
while attempt < 6:
try:
page = requests.get(url)
if page.status_code == 200:
result = page.content
else:
result = ''
except Exception,e:
result = ''
print 'Failed attempt (',attempt,'):', e
attempt += 1
sleep(random.randint(2,4))
continue
return result
if __name__ == '__main__':
main('enter the starting URL here')
答案 1 :(得分:0)
某个网页中的某些链接可能以相对地址的形式出现,我们需要考虑到这一点。 这应该可以解决问题。适用于python 3.4。
from urllib.request import urlopen
from urllib.parse import urljoin, urlsplit
from bs4 import BeautifulSoup
addr = input('enter an initial url: ')
while True:
html = BeautifulSoup(urlopen(addr).read(), 'html.parser')
list_links = []
num = 0
for link in html.find_all('a'):
url = link.get('href')
if not urlsplit(url).netloc:
url = urljoin(addr, url)
if urlsplit(url).scheme in ['http', 'https']:
print("%d : %s " % (num, str(url)))
list_links.append(url)
num += 1
idx = int(input("enter an index between 0 and %d: " % (len(list_links) - 1)))
if not 0 <= idx < len(list_links):
raise ValueError('Number must be between 0 and %d: ' % len(list_links))
addr = list_links[idx]