Question

我编写了程序来从http://www.stevens.edu/中提取网络链接。现在我正面临着该计划的以下问题。

1-我想只获得从http和https

开始的链接

2 - 我从bs4得到一个解析器警告，关于解析器缺乏规范 - 已解决

如何解决这个问题？我没有得到正确的方向来解决这个问题。

我的代码是 -

import urllib2

from bs4 import BeautifulSoup as bs
url = raw_input('Please enter the url for which you want to see unique web links -')

print "\n"

URLs (mostly HTTP) in a complex world
req = urllib2.Request(url, headers={'User-Agent': 'Mozilla/5.0'})  
html = urllib2.urlopen(req).read()
soup = bs(html)
tags = soup('a')
count = 0
web_link = []
for tag in tags:
    count = count + 1
    store = tag.get('href', None)
    web_link.append(store)
 print "Total no. of extracted web links are",count,"\n"
 print web_link
 print "\n"
 Unique_list = set(web_link)
 Unique_list = list(Unique_list)

 print "No. of the Unique web links after using set method", len(Unique_list),"\n"

Answer 1

对于第二个问题，您需要在创建页面的bs时指定解析器。
soup = bs(html,"html.parser")

这应该删除你的警告。

搜索独特的网络链接

1 个答案: