InvalidSchema:找不到连接适配器python3.5.2

时间:2017-02-17 10:09:41

标签: python-3.x httprequest bytestring

我正在尝试从网页中提取电子邮件,这是我的电子邮件抓取功能:

def emlgrb(x):
    email_set = set()
    for url in x:
        try:
            response = requests.get(url)
            soup = bs.BeautifulSoup(response.text, "lxml")
            emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", soup.text, re.I))
            email_set.update(emails)
        except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        continue
    return email_set

此函数应由另一个函数提供,该函数创建一个url列表。馈线功能:

def handle_local_links(url, link):
    if link.startswith("/"):
         return "".join([url, link])
    return link

def get_links(url):
    try:
        response = requests.get(url, timeout=5)
        soup = bs.BeautifulSoup(response.text, "lxml")
        body = soup.body
        links = [link.get("href") for link in body.find_all("a")]
        links = [handle_local_links(url, link) for link in links]
        links = [str(link.encode("ascii")) for link in links]
        return links

它继续存在许多异常,如果引发则返回空列表(不重要)。但是,get_links()的返回值如下所示:

["b'https://pythonprogramming.net/parsememcparseface//'"]

当然列表中有很多链接(不能发布 - 声誉)。 emlgrb()函数无法处理列表(InvalidSchema:找不到连接适配器)但是如果我手动删除b和冗余引号 - 所以列表看起来像这样:

['https://pythonprogramming.net/parsememcparseface//']

emlgrb()有效。任何建议问题或者是如何创建“清洁功能”以从第一个列表中获取第二个列表 - 都受到欢迎。

由于

1 个答案:

答案 0 :(得分:0)

解决方案是放弃.encode('ascii')

def get_links(url):
    try:
        response = requests.get(url, timeout=5)
        soup = bs.BeautifulSoup(response.text, "lxml")
        body = soup.body
        links = [link.get("href") for link in body.find_all("a")]
        links = [handle_local_links(url, link) for link in links]
        links = [str(link) for link in links]
        return links

您可以在str()中添加代码,例如in this pydocstr(object=b'', encoding='utf-8', errors='strict')

这是因为str()在.__repr__()上调用.__str__()object,因此如果是字节,则输出为"b'string'"。实际上,这是print(bytes_obj)时打印的内容。在str对象上调用.ecnode()会创建bytes对象!