Question

我一直在使用pyhton2.7中的BeautifulSoup进行爬虫，我遇到了这个错误：

AttributeError：'function'对象没有属性'urljoin `

实际上是在行：

first_link = urlparse.urljoin('https://en.wikipedia.org/', article_link)

我使用urlparse

导入了urljoin

from urlparse import urljoin

Answer 1

您导入了两件事：

from urlparse import urlparse
from urlparse import urljoin

因此，名称urlparse绑定到一个函数，而不是模块。只需将urljoin用作全局，而不是属性：

first_link = urljoin('https://en.wikipedia.org/', article_link)

Answer 2

我在 Python 2.7.18 上运行 urlparse 1.1.1 ，并且urljoin出现问题。据我了解，它不再受支持，但是我能够使用此方法正确提取每个URL。希望这可以帮助任何有类似问题的人

在解析提取的链接之前：

/intl/en/ads/
https://google.com/intl/en/ads/
/services/
https://google.com/services/
/intl/en/about.html

解析提取的链接后

https://google.com/intl/en/ads/
https://google.com/services/
https://google.com/intl/en/about.html
https://google.com/intl/en/policies/privacy/
https://google.com/intl/en/policies/terms/

代码以提取并加入链接（Linux上的终端脚本）：

#!/usr/bin/env python
import requests
import re
import urlparse


target_url = raw_input("Enter the target url\n>")


class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'


def request(url):
    try:
        return requests.get("http://" + url)
        print(get_response)
    except requests.exceptions.ConnectionError:
        pass


def extract_links_from(url):
    response = request(str(target_url))
    return re.findall('(?:href=")(.*?)"',response.content)


def urljoin(href_links):
    for link in href_links:

        if "https://" + target_url not in link and "https://" in link:
            print(bcolors.WARNING + link + bcolors.ENDC)
        elif "https://www." + target_url not in link:
            print(bcolors.OKGREEN + "https://" + target_url + link + bcolors.ENDC)
        else:
            print(bcolors.OKBLUE + link + bcolors.ENDC)


href_links = extract_links_from(target_url)
urljoin(href_links)

AttributeError：'function'对象没有属性'urljoin'

2 个答案: