BeautifulSoup截断表

时间:2015-04-24 21:07:00

标签: python character-encoding web-scraping beautifulsoup

我正在尝试编写一个Python脚本来处理所有joyo kanji。但是,我的脚本只获取表的前504个元素。全表有2,136个元素。此脚本演示了此问题:

from bs4 import BeautifulSoup 
from urllib2 import urlopen

url = "http://en.wikipedia.org/wiki/List_of_j%C5%8Dy%C5%8D_kanji"
soup = BeautifulSoup(urlopen(url))

print soup.prettify()

表中显示的最后一个元素是:

   <tr>
   <td>
    504
   </td>
   <td style="font-size:2em">
    <a href="//">
    </a>
   </td>
  </tr>
 </table>

但是,当我在chrome中查看该表时,我会看到元素504

<tr>
<td>504</td>
<td style="font-size:2em">
<a href="//en.wiktionary.org/wiki/%E6%BF%80" class="extiw" title="wikt:激">激</a>
</td>
...

我希望表格的最后一个元素是元素2,136。

1 个答案:

答案 0 :(得分:1)

It looks like you have a broken version of lxml or libxml (the actual C library doing the parsing) installed.

The page parses just fine for me on Python 2.7.9 with lxml 3.4.2 and libxml2 version 2.9.0.

You can tell BeautifulSoup to use the standard-library parser with:

soup = BeautifulSoup(urlopen(url), 'html.parser')

See Installing a parser about the implications of switching parsers.