使用Python进行网页抓取-输出中包含额外的字符

时间:2020-06-18 17:18:45

标签: python python-3.x

我是python的新手,正在学习udemy课程中的网络抓取。我正在尝试从演示站点中抓取一些输出,尽管我可以得到结果,但是看起来有些代码字符无法转换为常规文本。

#!/usr/bin/env python3.6
'''
webscraping html, webpage data.
1. get the authors of quotes on first page.
2. create a list of all the quotes on first page.
3. extract the top ten tags on the home page.
'''
import bs4, requests, urllib
from bs4 import BeautifulSoup

base_url = 'http://quotes.toscrape.com/'
with urllib.request.urlopen(base_url) as response:
    html = response.read()
    text = str(html)
    soup = BeautifulSoup(text, 'lxml')

def get_author():
    authors = set()
    for name in soup.select('.author'):
        authors.add(name.text)
    print(authors)

def get_quotes():
    quotes = []
    for quote in soup.select('.text'):
        quotes.append(quote.text)
    print(quotes)

def top_ten_tags():
    toptags = []
    for tags in soup.select('.tag-item'):
        toptags.append(tags.text)
    print(toptags)


get_author()
get_quotes()
top_ten_tags()

输出:

{'Albert Einstein', 'Jane Austen', 'Thomas A. Edison', 'J.K. Rowling', 'Andr\\xc3\\xa9 Gide', 'Steve Martin', 'Eleanor Roosevelt', 'Marilyn Monroe'}
['\\xe2\\x80\\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\\xe2\\x80\\x9d', '\\xe2\\x80\\x9cIt is our choices, Harry, that show what we truly are, far more than our abilities.\\xe2\\x80\\x9d', '\\xe2\\x80\\x9cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\\xe2\\x80\\x9d', '\\xe2\\x80\\x9cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\\xe2\\x80\\x9d', "\\xe2\\x80\\x9cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\\xe2\\x80\\x9d", '\\xe2\\x80\\x9cTry not to become a man of success. Rather become a man of value.\\xe2\\x80\\x9d', '\\xe2\\x80\\x9cIt is better to be hated for what you are than to be loved for what you are not.\\xe2\\x80\\x9d', "\\xe2\\x80\\x9cI have not failed. I've just found 10,000 ways that won't work.\\xe2\\x80\\x9d", "\\xe2\\x80\\x9cA woman is like a tea bag; you never know how strong it is until it's in hot water.\\xe2\\x80\\x9d", '\\xe2\\x80\\x9cA day without sunshine is like, you know, night.\\xe2\\x80\\x9d']
['\\n            love\\n            ', '\\n            inspirational\\n            ', '\\n            life\\n            ', '\\n            humor\\n            ', '\\n            books\\n            ', '\\n            reading\\n            ', '\\n            friendship\\n            ', '\\n            friends\\n            ', '\\n            truth\\n            ', '\\n            simile\\n            ']

您可以看到,作者集的名称应为“ Andre Gide”,带有变音符号,并且由于某些原因python无法打印该字符。对于第二个列表(即引号),它会打印我不喜欢的代码字符不明白。有人可以告诉我我在做什么错吗?

1 个答案:

答案 0 :(得分:1)

您的问题是强制将HTML文本转换为字符串,而不是对其进行正确解码:

text = html.decode("utf8")
soup = BeautifulSoup(text, 'lxml')
get_author()
#{... 'André Gide', 'Eleanor Roosevelt'...}