我有一个接收str值的函数但是当我执行错误时说这是一个字节值:
Traceback (most recent call last):
File "C:\Users\sdand\Documents\Python\Engine\engine.py", line 4, in <module>
print (find.crawl_web('https://google.com',4))
File "C:\Users\sdand\Documents\Python\Engine\finder.py", line 68, in crawl_web
links = self.get_all_links(content)
File "C:\Users\sdand\Documents\Python\Engine\finder.py", line 20, in get_all_links
url, endpos = self.get_next_target(page)
File "C:\Users\sdand\Documents\Python\Engine\finder.py", line 7, in get_next_target
start_link = s.find('<a href=')
TypeError: a bytes-like object is required, not 'str'
这是我调用get_all_links的函数:
def crawl_web(self,seed, max_depth):
tocrawl = [seed]
crawled = []
next_depth = []
depth = 0
index=[]
while tocrawl and depth <= max_depth:
page = tocrawl.pop()
if page not in crawled:
#here content content is str
content = self.get_page(page)
self.add_page_to_index(index,page,content)
links = self.get_all_links(content)
self.union(next_depth,links)
crawled.append(page)
if not tocrawl:
tocrawl, next_depth = next_depth, []
depth = depth+1
return index
这是get_page:
def get_page(self,url):
try:
import urllib.request
return urllib.request.urlopen(url).read()
except:
return ""
这是get_all_links:
def get_all_links(self,page):
#but here it is byte i dont now why
links=[]
while True:
url, endpos = self.get_next_target(page)
print(url)
if url != None:
links.append(url)
page = page[endpos:]
else:
break
return links
我现在不知道为什么我的str变量“Content”在get_all_links中以字节类型转换,有人可以向我解释这个,以及我如何解决它?
答案 0 :(得分:1)
您可能不知道,.read()
会返回一个byte
对象,而不是str
,尽管在网页抓取时更推荐使用byte
对象,这是最简单的解决方法是通过解码将其转换为str
。
return urllib.request.urlopen(url).read().decode('utf-8')