解析包含&的HTML页面使用Python

时间:2014-05-17 04:43:26

标签: python-2.7 urllib2 elementtree

我正在尝试使用urllib2和ElementTree在python中解析HTML页面,我在解析HTML时遇到了麻烦。网页包含“&”在引用的字符串中,但ElementTree会为包含&的行引发parseError

脚本:

import urllib2

url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
req = urllib2.Request(url, headers={'Content-type': 'text/xml'})
r = urllib2.urlopen(req).read()

import xml.etree.ElementTree as ET
htmlpage=ET.fromstring(r)

这会在Python 2.7中引发跟随错误

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1282, in XML
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1624, in feed
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 676, column 73

错误对应于以下行

<input type="hidden" id="HdnFldAndamanNicobar" value="1,Andaman & Nicobar Islands;" />

在阅读HTML页面时看起来像&amp;在变量r

中,符号未被解析为&amp;

我尝试使用R program和“&amp;”使用htmlTreeParse进行解析正确转换为&amp;

如果我在urllib2中遗漏任何内容,请告诉我

编辑:我取代了“&amp;”至&amp;,但第904行包含&lt;在javascript中签名会抛出相同的错误。应该有一个更好的选择,而不是替换字符。

LINE:904    for (i = 0; i < strac.length - 1; i++) {

1 个答案:

答案 0 :(得分:5)

首先,xml.etree.ElementTreeXML解析器。它不能处理开箱即用的HTML实体。 &an illegal thing to have inside the XML,这就是它失败的原因。

让自己使用真正专业的HTML解析器,BeautifulSoup

>>> from urllib2 import urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
>>> soup = BeautifulSoup(urlopen(url))
>>> soup.find('td').text.strip()
u'ELECTION COMMISSION OF INDIA'

另见: