我正在尝试从网页中提取电子邮件,这是我的电子邮件抓取功能:
def emlgrb(x):
email_set = set()
for url in x:
try:
response = requests.get(url)
soup = bs.BeautifulSoup(response.text, "lxml")
emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", soup.text, re.I))
email_set.update(emails)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
continue
return email_set
此函数应由另一个函数提供,该函数创建一个url列表。馈线功能:
def handle_local_links(url, link):
if link.startswith("/"):
return "".join([url, link])
return link
def get_links(url):
try:
response = requests.get(url, timeout=5)
soup = bs.BeautifulSoup(response.text, "lxml")
body = soup.body
links = [link.get("href") for link in body.find_all("a")]
links = [handle_local_links(url, link) for link in links]
links = [str(link.encode("ascii")) for link in links]
return links
它继续存在许多异常,如果引发则返回空列表(不重要)。但是,get_links()的返回值如下所示:
["b'https://pythonprogramming.net/parsememcparseface//'"]
当然列表中有很多链接(不能发布 - 声誉)。 emlgrb()函数无法处理列表(InvalidSchema:找不到连接适配器)但是如果我手动删除b和冗余引号 - 所以列表看起来像这样:
['https://pythonprogramming.net/parsememcparseface//']
emlgrb()有效。任何建议问题或者是如何创建“清洁功能”以从第一个列表中获取第二个列表 - 都受到欢迎。
由于
答案 0 :(得分:0)
解决方案是放弃.encode('ascii')
def get_links(url):
try:
response = requests.get(url, timeout=5)
soup = bs.BeautifulSoup(response.text, "lxml")
body = soup.body
links = [link.get("href") for link in body.find_all("a")]
links = [handle_local_links(url, link) for link in links]
links = [str(link) for link in links]
return links
您可以在str()
中添加代码,例如in this pydoc:str(object=b'', encoding='utf-8', errors='strict')
这是因为str()在.__repr__()
上调用.__str__()
或object
,因此如果是字节,则输出为"b'string'"
。实际上,这是print(bytes_obj)
时打印的内容。在str对象上调用.ecnode()
会创建bytes对象!