我使用Beautiful Soup 4扫描html文件并提取某些功能。具体来说,我用它来寻找足球运动员的名字,俱乐部,联赛,统计数据等等。由于许多球员和俱乐部名称都有重音标记,我正在寻找一种打印出这些重音标记的方法,而不是看到像“Kak”这样的输出。 xe1“我能够通过使用
使其工作# open html page
fut_page = open('futhead1.html','r')
# read content from html page
fut_read = fut_page.read()
# html parsed page
fut_soup = BeautifulSoup(fut_read, "html.parser")
# grabs all players
players = fut_soup.findAll('li',{'class':'list-group-item list-group-table-row player-group-item dark-hover'})
player = players[2]
# name_tag contains tag with player's name
name_tag = player.find("span",{"class":"player-name"})
# extract just the player's name
player_name = name_tag.text
print player_name.encode('utf-8')
这会打印出正确的玩家名称:“Kaká”但是,在使用正则表达式提取俱乐部名称时,我看不到相同的结果,例如
regex_club = re.compile(ur'\[.*?</strong>\\n\s+\|\s\\n\s+(.*?)\\n', re.MULTILINE)
# extract club name
player_club = re.match(regex_club, str(pos_clb_lge_tag))
print player_club.group(1).encode('utf-8')
此代码适用于打印出适当的俱乐部名称,例如“Atl \ xe9tico Madrid”,但编码()不能用于删除“\ xe9”并将其替换为“é”
下面是我应用正则表达式
的html文件<li class="list-group-item list-group-table-row player-group-item dark-hover">
<div class="content player-item font-24">
<a class="display-block padding-0" href="/fifa-mobile/17/players/33194/jan-oblak/">
<span class="player-rating stream-col-50 text-center">
<span class="revision-gradient shadowed font-12 fut elite">100</span>
</span>
<span class="player-info">
<img class="player-image" src="http://futhead.cursecdn.com/static/img/fm/17/players/200389_SASC.png">
<img class="player-program" src="http://futhead.cursecdn.com/static/img/fm/17/resources/program_17_VSATTACK.png">
<span class="player-name">Jan Oblak</span>
<span class="player-club-league-name">
<strong>GK</strong>
|
Atlético Madrid
|
LaLiga Santander
</span>
</span>
<span class="player-right text-center hidden-xs">
<span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">83</span><span class="hover-label">PAC</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">50</span><span class="hover-label">SHO</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">66</span><span class="hover-label">PAS</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">55</span><span class="hover-label">DRI</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">58</span><span class="hover-label">DEF</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">85</span><span class="hover-label">PHY</span></span><span class="player-stat stream-col-60 font-12 font-medium text-upper">35</span>
</span>
<span class="player-right slide hidden-sm hidden-xs" data-direction="right" data-max="-482px">
<span class="slide-content text-upper">
<span class="trigger icon icon-dots-three-horizontal"></span>
<span class="player-stat stream-col-80">
<span class="value">+2</span>
<span class="hover-label">MRK</span>
</span>
<span class="player-stat stream-col-80">
<span class="value">+1</span>
<span class="hover-label">OVR</span>
</span>
<span class="player-stat stream-col-100"><span class="value">right</span><span class="hover-label">Strong Foot</span></span>
<span class="player-stat stream-col-100"><span class="value">18<span class="icon icon-star gold margin-l-4"></span></span><span class="hover-label">Weak Foot</span></span>
</span>
</span>
</a>
</div>
基本上,为什么我在中间体中使用正则表达式时,encode()不起作用?如果需要进一步澄清,请告诉我。谢谢。
答案 0 :(得分:0)
我怀疑你没有显示所有代码(参见[mcve]),但是在Unicode对象上调用str
是错误的,应该给出:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 40: ordinal not in range(128)
我怀疑你已经完成了setdefaultencoding
bad habit。
str()
所做的是将Unicode字符串转换为带有转义码文本的字节字符串,例如'\\n'
(两个字符)而不是'\n'
(一个字符),它对非ascii字符也是如此。
如果您的终端配置正确,您也不必在打印时手动编码最终结果。
这是一个使用BeautifulSoup只检索要解析的文本的工作示例:
from bs4 import BeautifulSoup
import re
# open html page
fut_page = open('futhead1.html','r')
# read content from html page
fut_read = fut_page.read()
# html parsed page
fut_soup = BeautifulSoup(fut_read, "html.parser")
# grabs all players
players = fut_soup.findAll('li',{'class':'list-group-item list-group-table-row player-group-item dark-hover'})
player = players[0]
# name_tag contains tag with player's name
name_tag = player.find("span",{"class":"player-club-league-name"})
# extract just the player's name
pos_clb_lge_tag = name_tag.contents[-1]
regex_club = re.compile(ur'\n\s+\|\s\n\s+(.*?)\n')
# extract club name
player_club = regex_club.match(pos_clb_lge_tag)
print player_club.group(1)
AtléticoMadrid