Question

我正在尝试在python中编写一个小型的Web抓取工具，我想我遇到了一个编码问题。我正试图刮http://www.resident-music.com/tickets（特别是页面上的表格） - 一行可能看起来像这样 -

    <tr>
        <td style="width:64.9%;height:11px;">
         <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p>
        </td>
        <td style="width:13.1%;height:11px;">
         <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p>
        </td>
        <td style="width:15.42%;height:11px;">
         <p><strong>various</strong></p>
        </td>
        <td style="width:6.58%;height:11px;">
         <p><strong>&pound;55.00</strong></p>
        </td>
       </tr>

我基本上试图用£55替换£55.00，以及任何其他'非文本'恶意代替。

我尝试了一些不同的编码方法，你可以使用beautifulsoup和urllib2 - 无济于事，我想我只是做错了。

由于

Answer 1

你想 unescape 你可以在python3中使用 html.unescape 做的html：

In [14]: from html import unescape

In [15]: h = """<tr>
   ....:         <td style="width:64.9%;height:11px;">
   ....:          <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p>
   ....:         </td>
   ....:         <td style="width:13.1%;height:11px;">
   ....:          <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p>
   ....:         </td>
   ....:         <td style="width:15.42%;height:11px;">
   ....:          <p><strong>various</strong></p>
   ....:         </td>
   ....:         <td style="width:6.58%;height:11px;">
   ....:          <p><strong>&pound;55.00</strong></p>
   ....:         </td>
   ....:        </tr>"""

In [16]: 

In [16]: print(unescape(h))
<tr>
        <td style="width:64.9%;height:11px;">
         <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p>
        </td>
        <td style="width:13.1%;height:11px;">
         <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
        </td>
        <td style="width:15.42%;height:11px;">
         <p><strong>various</strong></p>
        </td>
        <td style="width:6.58%;height:11px;">
         <p><strong>£55.00</strong></p>
        </td>
       </tr>

对于python2使用：

In [6]: from html.parser import HTMLParser

In [7]: unescape = HTMLParser().unescape  

In [8]: print(unescape(h))
<tr>
        <td style="width:64.9%;height:11px;">
         <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p>
        </td>
        <td style="width:13.1%;height:11px;">
         <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
        </td>
        <td style="width:15.42%;height:11px;">
         <p><strong>various</strong></p>
        </td>
        <td style="width:6.58%;height:11px;">
         <p><strong>£55.00</strong></p>
        </td>

你可以正确地看到所有实体，而不仅仅是英镑符号。

Answer 2

我使用了requests，但希望您也可以使用urllib2。所以这是代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests 
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(requests.get('your_url').text)
chart = soup.findAll(name='tr') 
print str(chart).replace('&pound;',unichr(163)) #replace '&pound;' with '£'

现在你应该采取预期的输出！

示例输出：

...
<strong>£71.50</strong></p>
...

无论如何关于解析你可以用很多方式来做，这里有趣的是：print str(chart).replace('£',unichr(163))这是非常具有挑战性的：）

更新

如果你想要逃脱多个（甚至一个）角色（如破折号，磅等等），那么在Padraic＆＃39中使用parser会更容易/更有效率的答案。有时您还会阅读他们处理的评论和其他编码问题。

£显示在urllib2和Beautiful Soup中

2 个答案: