来自requests / bs4的python unicode char

时间:2018-01-06 19:18:54

标签: python unicode beautifulsoup python-requests

我有一个脚本,可以使用请求和bs4来获取来自metrolyrics的歌曲的歌词

问题在于,当我打印它时会显示类似这样的内容(部分歌词)

Rabbi, Papa, Allah, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, Vino, Kashèr, ḤalÄl, Yom Kippur, Quaresima, Ramadan

它看起来应该是这样的

Rabbi, Papa, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, vino, kashèr, ḥalāl, Yom Kippur, Quaresima, Ramadan

我使用的代码

import requests
from bs4 import BeautifulSoup
import os

try:
    from urllib.parse import quote_plus
except ImportError:
    from urllib import quote_plus

def get_lyrics(song_name):
    song_name += ' metrolyrics'
    name = quote_plus(song_name)
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11'
           '(KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept-Language': 'en-US,en;q=0.8',
           'Connection': 'keep-alive'}

    url = 'http://www.google.com/search?q=' + name

    result = requests.get(url, headers=hdr).text
    link_start = result.find('http://www.metrolyrics.com')

    if(link_start == -1):
        return("Lyrics not found on Metrolyrics")

    link_end = result.find('html', link_start + 1)
    link = result[link_start:link_end + 4]


    lyrics_html = requests.get(link, headers={
                               'User-Agent': 'Mozilla/5.0 (Macintosh; Intel'
                               'Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, '
                               'like Gecko) Chrome/55.0.2883.95 Safari/537.36'
                               }
                               ).text

    soup = BeautifulSoup(lyrics_html, "lxml")
    raw_lyrics = (soup.findAll('p', attrs={'class': 'verse'}))
    paras = []
    try:
        final_lyrics = unicode.join(u'\n', map(unicode, raw_lyrics))
    except NameError:
        final_lyrics = str.join(u'\n', map(str, raw_lyrics))

    final_lyrics = (final_lyrics.replace('<p class="verse">', '\n'))
    final_lyrics = (final_lyrics.replace('<br/>', ' '))
    final_lyrics = final_lyrics.replace('</p>', ' ')
    return (final_lyrics)

我尝试了.encode('utf-8') .encode('unicode-escape')并再次重新转换但没有解决方案

我有另一个脚本,我使用musixmatch api并在那里显示unicode正确

1 个答案:

答案 0 :(得分:1)

我在get_lyrics函数中做了一些小改动:

return final_lyrics.encode('latin1').decode('utf-8')

得到了理想的输出:

# python2
print get_lyrics('kashèr')
...
Rabbi, Papa, Allah, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, Vino, Kashèr, Ḥalāl, Yom Kippur, Quaresima, Ramadan
...