从网站解析内部链接

时间:2017-09-09 13:25:29

标签: python parsing hyperlink beautifulsoup bs4

我需要解析来自任何网站的链接(我在开始解析之前设置了一个链接)。链接应该是内部的:即不要超出当前的网站(应忽略外部链接)。我写了一部分程序代码,但是我得到了一些不必要的链接,例如:'#',' tel:+ 7845225-17-72'等等。我如何获得内部链接,例如:' mysite.ru/delivery'或者' / delivery' (在最后一个变体中只显示了部分地址)?

我的代码:

from urllib.parse import urlparse
from bs4 import BeautifulSoup, SoupStrainer
import requests

url = 'http://101-rosa.ru'
r = requests.get(url)

soup = BeautifulSoup(r.content, 'html.parser', parse_only=SoupStrainer('a'))
urls = [link['href'] for link in soup if link.get('href')]

for u in urls:
    nu = u.contents(0)
    r = requests.get(nu)
    soup2 = BeautifulSoup(r.content, 'html.parser', parse_only=SoupStrainer('a'))
    url = [link['href'] for link in soup2 if link.get('href')]
    nu = urlparse(url)
    if nu.netloc == 'www.http://101-rosa.ru:80' and urls.count(url) == 0:
        urls.append(url)


print(len(urls))
print(urls)

0 个答案:

没有答案