我一直在使用pyhton2.7中的BeautifulSoup进行爬虫,我遇到了这个错误:
AttributeError:'function'对象没有属性'urljoin `
实际上是在行:
first_link = urlparse.urljoin('https://en.wikipedia.org/', article_link)
我使用urlparse
导入了urljoin from urlparse import urljoin
答案 0 :(得分:5)
您导入了两件事:
from urlparse import urlparse
from urlparse import urljoin
因此,名称urlparse
绑定到一个函数,而不是模块。只需将urljoin
用作全局,而不是属性:
first_link = urljoin('https://en.wikipedia.org/', article_link)
答案 1 :(得分:0)
我在 Python 2.7.18 上运行 urlparse 1.1.1 ,并且urljoin出现问题。据我了解,它不再受支持,但是我能够使用此方法正确提取每个URL。希望这可以帮助任何有类似问题的人
在解析提取的链接之前:
/intl/en/ads/
https://google.com/intl/en/ads/
/services/
https://google.com/services/
/intl/en/about.html
解析提取的链接后
https://google.com/intl/en/ads/
https://google.com/services/
https://google.com/intl/en/about.html
https://google.com/intl/en/policies/privacy/
https://google.com/intl/en/policies/terms/
代码以提取并加入链接(Linux上的终端脚本):
#!/usr/bin/env python
import requests
import re
import urlparse
target_url = raw_input("Enter the target url\n>")
class bcolors:
HEADER = '\033[95m'
OKBLUE = '\033[94m'
OKGREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
def request(url):
try:
return requests.get("http://" + url)
print(get_response)
except requests.exceptions.ConnectionError:
pass
def extract_links_from(url):
response = request(str(target_url))
return re.findall('(?:href=")(.*?)"',response.content)
def urljoin(href_links):
for link in href_links:
if "https://" + target_url not in link and "https://" in link:
print(bcolors.WARNING + link + bcolors.ENDC)
elif "https://www." + target_url not in link:
print(bcolors.OKGREEN + "https://" + target_url + link + bcolors.ENDC)
else:
print(bcolors.OKBLUE + link + bcolors.ENDC)
href_links = extract_links_from(target_url)
urljoin(href_links)