无法附加基本URL以与Beatifulsoup Python 3创建绝对链接

时间:2017-03-06 01:15:02

标签: python-3.x beautifulsoup python-requests

我在输出文件中获得了一个链接列表,但需要显示所有链接作为绝对链接。有些是绝对的,有些是相对的。如何将基本URL附加到亲属以确保我只获得csv输出中的绝对链接?

我收回所有链接,但并非所有链接都是绝对链接,例如/ subpage而不是http://page.com/subpage

    from bs4 import BeautifulSoup
    import requests 
    import csv

    j = requests.get("http://cnn.com").content
    soup = BeautifulSoup(j, "lxml") 

    #only return links to subpages e.g. a tag that contains href
    data = []
        for url in soup.find_all('a', href=True):
        print(url['href'])
        data.append(url['href'])

    print(data)

    with open("file.csv",'w') as csvfile:
    write = csv.writer(csvfile, delimiter = ' ')
    write.writerows(data)

    content = open('file.csv', 'r').readlines()
    content_set = set(content)
    cleandata = open('file.csv', 'w')

    for line in content_set:
        cleandata.write(line)

1 个答案:

答案 0 :(得分:1)

urljoin

from urlparse import urljoin
...
base_url = "http://cnn.com"
absolute_url = urljoin(base_url, relative_url)