编码()在所有情况下都不起作用

时间:2017-07-20 18:14:03

标签: python regex unicode encode

我使用Beautiful Soup 4扫描html文件并提取某些功能。具体来说,我用它来寻找足球运动员的名字,俱乐部,联赛,统计数据等等。由于许多球员和俱乐部名称都有重音标记,我正在寻找一种打印出这些重音标记的方法,而不是看到像“Kak”这样的输出。 xe1“我能够通过使用

使其工作
# open html page
fut_page = open('futhead1.html','r')
# read content from html page
fut_read = fut_page.read()
# html parsed page
fut_soup = BeautifulSoup(fut_read, "html.parser")
# grabs all players
players = fut_soup.findAll('li',{'class':'list-group-item list-group-table-row player-group-item dark-hover'})
player = players[2]
# name_tag contains tag with player's name
name_tag = player.find("span",{"class":"player-name"})
# extract just the player's name
player_name = name_tag.text
print player_name.encode('utf-8')

这会打印出正确的玩家名称:“Kaká”但是,在使用正则表达式提取俱乐部名称时,我看不到相同的结果,例如

regex_club = re.compile(ur'\[.*?</strong>\\n\s+\|\s\\n\s+(.*?)\\n', re.MULTILINE)
# extract club name
player_club = re.match(regex_club, str(pos_clb_lge_tag))
print player_club.group(1).encode('utf-8')

此代码适用于打印出适当的俱乐部名称,例如“Atl \ xe9tico Madrid”,但编码()不能用于删除“\ xe9”并将其替换为“é”

下面是我应用正则表达式

的html文件
<li class="list-group-item list-group-table-row player-group-item dark-hover">
<div class="content player-item font-24">
    <a class="display-block padding-0" href="/fifa-mobile/17/players/33194/jan-oblak/">
        <span class="player-rating stream-col-50 text-center">
            <span class="revision-gradient shadowed font-12 fut elite">100</span>
        </span>
        <span class="player-info">
            <img class="player-image" src="http://futhead.cursecdn.com/static/img/fm/17/players/200389_SASC.png">
            <img class="player-program" src="http://futhead.cursecdn.com/static/img/fm/17/resources/program_17_VSATTACK.png">
            <span class="player-name">Jan Oblak</span>
            <span class="player-club-league-name">
                <strong>GK</strong>
                 | 
                Atlético Madrid
                 | 
                LaLiga Santander
            </span>
        </span>

        <span class="player-right text-center hidden-xs">
            <span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">83</span><span class="hover-label">PAC</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">50</span><span class="hover-label">SHO</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">66</span><span class="hover-label">PAS</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">55</span><span class="hover-label">DRI</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">58</span><span class="hover-label">DEF</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">85</span><span class="hover-label">PHY</span></span><span class="player-stat stream-col-60 font-12 font-medium text-upper">35</span>
        </span>
        <span class="player-right slide hidden-sm hidden-xs" data-direction="right" data-max="-482px">
            <span class="slide-content text-upper">
                <span class="trigger icon icon-dots-three-horizontal"></span>


                <span class="player-stat stream-col-80">
                    <span class="value">+2</span>
                    <span class="hover-label">MRK</span>
                </span>


                <span class="player-stat stream-col-80">
                    <span class="value">+1</span>
                    <span class="hover-label">OVR</span>
                </span>

                <span class="player-stat stream-col-100"><span class="value">right</span><span class="hover-label">Strong Foot</span></span>
                <span class="player-stat stream-col-100"><span class="value">18<span class="icon icon-star gold margin-l-4"></span></span><span class="hover-label">Weak Foot</span></span>
            </span>
        </span>

    </a>
</div>

基本上,为什么我在中间体中使用正则表达式时,encode()不起作用?如果需要进一步澄清,请告诉我。谢谢。

1 个答案:

答案 0 :(得分:0)

我怀疑你没有显示所有代码(参见[mcve]),但是在Unicode对象上调用str是错误的,应该给出:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 40: ordinal not in range(128)

我怀疑你已经完成了setdefaultencoding bad habit

str()所做的是将Unicode字符串转换为带有转义码文本的字节字符串,例如'\\n'(两个字符)而不是'\n'(一个字符),它对非ascii字符也是如此。

如果您的终端配置正确,您也不必在打印时手动编码最终结果。

这是一个使用BeautifulSoup只检索要解析的文本的工作示例:

from  bs4 import BeautifulSoup
import re

# open html page
fut_page = open('futhead1.html','r')
# read content from html page
fut_read = fut_page.read()
# html parsed page
fut_soup = BeautifulSoup(fut_read, "html.parser")
# grabs all players
players = fut_soup.findAll('li',{'class':'list-group-item list-group-table-row player-group-item dark-hover'})
player = players[0]
# name_tag contains tag with player's name
name_tag = player.find("span",{"class":"player-club-league-name"})
# extract just the player's name
pos_clb_lge_tag = name_tag.contents[-1]
regex_club = re.compile(ur'\n\s+\|\s\n\s+(.*?)\n')
# extract club name
player_club = regex_club.match(pos_clb_lge_tag)
print player_club.group(1)
  

AtléticoMadrid