如何处理InvalidSchema异常

时间:2018-11-22 15:44:18

标签: python python-3.x function web-scraping return

我已经使用两个函数在python中编写了一个脚本。第一个功能get_links()从网页中获取一些链接,并将这些链接返回到另一个功能get_info()。此时,功能get_info()应该从不同的链接产生不同的店铺名称,但是会引发错误raise InvalidSchema("No connection adapters were found for '%s'" % url)

这是我的尝试:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

def get_links(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    elem = soup.select(".info h2 a[data-analytics]")
    return get_info(elem)

def get_info(url):
    response = requests.get(url)
    print(response.url)
    soup = BeautifulSoup(response.text,"lxml")
    return soup.select_one("#main-header .sales-info h1").get_text(strip=True)

if __name__ == '__main__':
    link = 'https://www.yellowpages.com/search?search_terms=%20Injury%20Law%20Attorneys&geo_location_terms=California&page=2'    
    for review in get_links(link):
        print(urljoin(link,review.get("href")))

我要在这里学习的关键是return get_info(elem)

的实际用法

我创建了与此return get_info(elem)相关的另一个线程。 Link to that thread

当我尝试以下操作时,我会相应地得到结果

def get_links(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    elem = soup.select(".info h2 a[data-analytics]")
    return elem

def get_info(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    return soup.select_one("#main-header .sales-info h1").get_text(strip=True)

if __name__ == '__main__':
    link = 'https://www.yellowpages.com/search?search_terms=%20Injury%20Law%20Attorneys&geo_location_terms=California&page=2'    
    for review in get_links(link):
        print(get_info(urljoin(link,review.get("href"))))

我的问题:如何根据我使用return get_info(elem)的第一个脚本尝试的方式获得结果?

1 个答案:

答案 0 :(得分:2)

检查每个函数返回的内容。在这种情况下,第一个脚本中的函数将永远不会运行。原因是get_info接受URL,而不接受其他任何URL。因此很明显,当您运行get_info(elem)时,您将遇到一个错误,其中elemsoup.select()选择的项目的列表。

尽管您应该已经了解上述内容,因为您正在遍历第二个脚本的结果,该脚本仅返回列表以获取href元素。因此,如果要在第一个脚本中使用get_info,请将其应用于列表以外的项目,则在这种情况下可以使用列表理解。

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

def get_links(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    elem = soup.select(".info h2 a[data-analytics]")
    return [get_info(urljoin(link,e.get("href"))) for e in elem] 

def get_info(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    return soup.select_one("#main-header .sales-info h1").get_text(strip=True)

link = 'https://www.yellowpages.com/search?search_terms=%20Injury%20Law%20Attorneys&geo_location_terms=California&page=2'

for review in get_links(link): 
    print(review) 

现在,您知道第一个函数仍然返回一个列表,但是将get_info应用于其元素,这是rite的工作方式吗? get_info接受URL而不是列表。从那里开始,因为您已经在url_join中应用了get_infoget_links,所以可以将其循环以打印结果。