Question

我正试图抓住NBA球员的统计数据，意图对他们进行一些机器学习，我发现这些'可打印的球员文件'有一堆很好的统计数据。不幸的是，我正在尝试使用BeautifulSoup来解析html，它根本不起作用。例如：

from bs4 import BeautifulSoup
import codecs
import urllib2

url = 'http://www.nba.com/playerfile/ray_allen/printable_player_files.html'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

with open('ray_allen.txt', 'w') as f:
    f.write(soup.prettify())
    f.close()

给我一个看起来像这样的文件：

<html>
 <head>
  <!--no description was found-->
  <!--no title was found-->
  <!--no keywords found-->
  <!--not article-->
  <script>
   var site = "nba";
var page = "player";
  </script>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <script language="Javascript">
   &lt;!--
var flashinstalled = 0;
var flashversion = 0;
MSDetect = "false";
if (navigator.plugins &amp;&amp; navigator.plugins.length) {
    x = navigator.plugins["Shockwave Flash"];
    if (x) {
        flashinstalle   d       =       2   ;   

           i   f       (   x   .   d   e   s   c   r   i   p   t   i   o   n   )       {   

               y       =       x   .   d   e   s   c   r   i   p   t   i   o   n   ;   

               f   l   a   s   h   v   e   r   s   i   o   n       =       y   .   c   h   a   r   A   t   (   y   .   i   n   d   e   x   O   f   (   '   .   '   )   -   1   )   ;   

           }   

       }       e   l   s   e   

           f   l   a   s   h   i   n   s   t   a   l   l   e   d       =       1   ;   

       i   f       (   n   a   v   i   g   a   t   o   r   .   p   l   u   g   i   n   s   [   "   S   h   o   c   k   w   a   v   e       F   l   a   s   h       2   .   0   "   ]   )       {   

           f   l   a   s   h   i   n   s   t   a   l   l   e   d       =       2   ;   

           f   l   a   s   h   v   e   r   s   i   o   n       =       2   ;   

       }   
[...]

然后在完成之前继续另外3000多行（[...]由我添加）：

[...]
   &lt;   /   b   o   d   y   &gt;   

   &lt;   /   h   t   m   l   &gt;
  </script>
 </head>
</html>

我也试过'http://www.basketball-reference.com/players/a/allenra02.html'而且那个给了我这个错误：

Traceback（最近一次调用最后一次）：文件“test.py”，第9行，in f.write（soup.prettify（））UnicodeEncodeError：'ascii'编解码器无法对位置6167中的字符u'\ xb7'进行编码：序数不在范围内（128）

也许我应该用别的东西来解析HTML？或者这些问题中的一个容易修复？我在这里看到的似乎表明，使用BeautifulSoup应该让事情变得简单而不是努力！

编辑：行：

print soup.prettify()

适用于终端中的第二页，因此在尝试写入文件时会发生一些事情 - 这对BeautifulSoup来说不是问题

Answer 1

这与bug 972466具有相同的症状，在4.0.3中已修复。我建议升级到最新版的Beautiful Soup 4。

Answer 2

这看起来像是BeautifulSoup 4中的一个错误。

我通过将from bs4 import BeautifulSoup更改为from BeautifulSoup import BeautifulSoup，使用BeautifulSoup 3（在Ubuntu中打包）尝试了您的代码，并且它按预期工作。当我使用v4（运行你的代码不变）时，我重现了你的问题。该错误似乎在解析器而不是prettify中，因为打印soup对象会显示相同的问题。

请将其作为https://bugs.launchpad.net/beautifulsoup/的错误提交。同时，使用版本3。

BeautifulSoup没有正确阅读文件

2 个答案: