Question

我正在尝试使用requests模块构建网络抓取工具，基本上我想要它做的是转到一个网页，获取所有href，然后将它们写入文本文件。

到目前为止，我的代码看起来像这样：

def getLinks(url):
response = requests.get(url).text
soup = BeautifulSoup(response,"html.parser")
for link in soup.findAll("a"):
    print("Link:"+str(link.get("href")))

适用于某些网站但是我试图在href上使用它的那个并不是像“www.google.com”这样的完整域名，而是它们就像是...重定向到目录的路径的路径链路？

看起来像这样：

href="/out/101"

如果我尝试将其写入文件，它看起来像这样

 1. /out/101
 2. /out/102
 3. /out/103
 4. /out/104

这不是我想要的。

soo如何从这些链接获取域名？

Answer 1

这意味着网址相对到当前。要获取完整的网址，请使用urljoin()：

from urlparse import urljoin

for link in soup.findAll("a"): 
    full_url = urljoin(url, link.get("href"))
    print("Link:" + full_url)

Answer 2

尝试以下代码。它将为您提供网站上的所有链接。如果您知道网站的base url，那么您可以从中提取所有其他网址。整个网络抓取代码在这里WebScrape

import requests
import lxml.html
from bs4 import BeautifulSoup

def extractLinks(url, base):
        '''
        Return links from the website
        :param url: Pass the url
        :param base: this is the base links
        :return: list of links
        '''
        links = [] #it will contain all the links from the website
        try:
            r = requests.get(url)
        except:
            return []
        obj = lxml.html.fromstring(r.text)
        potential_links = obj.xpath("//a/@href")
        links.append(r.url)
        #print potential_links
        for link in potential_links:
            if base in link:
                links.append(link)
            else:
                if link.startswith("http"):
                    links.append(link)

                elif base.endswith("/"):
                    if link.startswith("/"):
                        link = link.lstrip("/")
                        link = base + link
                    else:
                        link = base + link
                    links.append(link)

        return links

extractLinks('http://data-interview.enigmalabs.org/companies/',
    'http://data-interview.enigmalabs.org/')

Python - 请求模块，获取域名？

2 个答案: