为什么当我在str中传递一个参数时,这将是字节

时间:2017-06-16 03:22:43

标签: python python-3.x urllib

我有一个接收str值的函数但是当我执行错误时说这是一个字节值:

 Traceback (most recent call last):
  File "C:\Users\sdand\Documents\Python\Engine\engine.py", line 4, in <module>
    print (find.crawl_web('https://google.com',4))
  File "C:\Users\sdand\Documents\Python\Engine\finder.py", line 68, in crawl_web
    links = self.get_all_links(content)
  File "C:\Users\sdand\Documents\Python\Engine\finder.py", line 20, in get_all_links
    url, endpos = self.get_next_target(page)
  File "C:\Users\sdand\Documents\Python\Engine\finder.py", line 7, in get_next_target
    start_link = s.find('<a href=')
TypeError: a bytes-like object is required, not 'str'

这是我调用get_all_links的函数:

def crawl_web(self,seed, max_depth):
        tocrawl = [seed]
        crawled = []
        next_depth = []
        depth = 0
        index=[]

        while tocrawl and depth <= max_depth:
            page = tocrawl.pop()
            if page not in crawled:
                #here content content is str
                content = self.get_page(page)
                self.add_page_to_index(index,page,content)
                links = self.get_all_links(content)
                self.union(next_depth,links)
                crawled.append(page)
            if not tocrawl:
                tocrawl, next_depth = next_depth, []
                depth = depth+1
        return index

这是get_page:

def get_page(self,url):
        try:
            import urllib.request

            return  urllib.request.urlopen(url).read()
        except:
            return ""

这是get_all_links:

def get_all_links(self,page):
        #but here it is byte i dont now why
        links=[]
        while True:
            url, endpos = self.get_next_target(page)
            print(url)
            if url != None:
                links.append(url)
                page = page[endpos:]
            else:
                break
        return links

我现在不知道为什么我的str变量“Content”在get_all_links中以字节类型转换,有人可以向我解释这个,以及我如何解决它?

1 个答案:

答案 0 :(得分:1)

您可能不知道,.read()会返回一个byte对象,而不是str,尽管在网页抓取时更推荐使用byte对象,这是最简单的解决方法是通过解码将其转换为str

return urllib.request.urlopen(url).read().decode('utf-8')