使用BeautifulSoup在单引号属性值中解析未转义的撇号

时间:2015-03-16 09:54:51

标签: python html attributes escaping beautifulsoup

从网页上,我想获取所有链接和标题字符串。我使用BeautifulSoup 4进行​​刮擦。网页上的链接如下所示:

<a href='http://www.example1.com' title='A small secret for better estimates #4/16/2014 8:10:30 AM'> Example 1 </a>
<a href='http://www.example2.com' title='Don't make me think #4/9/2014 4:36:07 AM'> Example 2</a>

刮削解决方案效果很好:

#Import
import codecs
import urllib   
from bs4 import BeautifulSoup

#Parse
url = "http://www.website-to-scrape.com"
sock = urllib.urlopen(url)
htmlsrc = sock.read()
sock.close()
html = BeautifulSoup(htmlsrc)
html.__str__()
alllinks = html.find_all('a', href=True, title = True)

for tags in range(len(alllinks)-1):
        link = alllinks[tags]['href'].encode('utf-8') 
        title = alllinks[tags]['title'].encode('utf-8')
print title

问题: BeautifulSoup不知道如何正确转义字符串中的单引号,即'

因此,对于example2,它只会输出Don

A small secret for better estimates #4/16/2014 8:10:30 AM
Don

1 个答案:

答案 0 :(得分:2)

问题不是BeautifulSoup,而是你的HTML,这是无效的。根据{{​​3}},单引号属性值具有以下语法:

  

属性名称,后跟零个或多个空格字符,后跟一个U + 003D EQUALS SIGN字符,后跟零个或多个空格字符,后跟一个U + 0027 APOSTROPHE字符('),后跟属性值,除了上面给出的属性值要求外,不得包含任何文字U + 0027 APOSTROPHE字符('),最后是第二个单个U + 0027 APOSTROPHE字符( “)。

虽然BeautifulSoup的所有解析器HTML specification都会尝试来解析你问题中的无效HTML,但是没有一个会做你想做的事情:

>>> BeautifulSoup(src, "html.parser")
<a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a #4="" 2014="" 4:36:07="" 9="" am'="" href="http://www.example2.com" make="" me="" t="" think="" title="Don"> Example 2</a>
>>> BeautifulSoup(src, "lxml")
<html><body><a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a am="" href="http://www.example2.com" make="" me="" t="" think="" title="Don"> Example 2</a>
</body></html>
>>> BeautifulSoup(src, "html5lib")
<html><head></head><body><a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a #4="" 2014="" 4:36:07="" 9="" am'="" href="http://www.example2.com" make="" me="" t="" think="" title="Don"> Example 2</a>
</body></html>

任何现代浏览器都不会:

<强>火狐
Firefox developer tools

<强>铬
Chrome developer tools

IE 11
IE11 developer tools

如果要在单引号属性值中表示撇号,则需要使用&apos;字符实体引用:

>>> BeautifulSoup("""
... <a href='http://www.example1.com' title='A small secret for better estimates #4/16/2014 8:10:30 AM'> Example 1 </a>
... <a href='http://www.example2.com' title='Don&apos;t make me think #4/9/2014 4:36:07 AM'> Example 2</a>
... """)
<html><body><a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a href="http://www.example2.com" title="Don't make me think #4/9/2014 4:36:07 AM"> Example 2</a>
</body></html>

或者,您可以使用双引号属性值:

>>> BeautifulSoup("""
... <a href='http://www.example1.com' title='A small secret for better estimates #4/16/2014 8:10:30 AM'> Example 1 </a>
... <a href='http://www.example2.com' title="Don't make me think #4/9/2014 4:36:07 AM"> Example 2</a>
... """)
<html><body><a href="http://www.example1.com" title="A small secret for better estimates #4/16/2014 8:10:30 AM"> Example 1 </a>
<a href="http://www.example2.com" title="Don't make me think #4/9/2014 4:36:07 AM"> Example 2</a>
</body></html>