BeautifulSoup使用find_all返回空(“span”,text = re.compile(“T”))

时间:2015-03-25 07:23:53

标签: python beautifulsoup

可以从here

下载html文件
soup = BeautifulSoup(open(r"test.html"),from_encoding="ascii")
In [43]:soup.find_all("span")
Out[43]:
    [<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:648px; height:783px;"></span>,
     <span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:7px">S
     <br/></span>,
     <span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:7px">T
     <br/></span>,
     <span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:8px">N
     <br/></span>,
     <span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:7px">E
     <br/></span>,
     <span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:7px">T
     <br/></span>,
     <span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:8px">N
     <br/></span>]

 In [44]:soup.find_all("span", text = re.compile("T"))
 Out[44]:[]

为什么它会返回空列表?这与编码有关吗?

更新:以下代码有效:

In [87]: 
def aa(tag):
    return tag.name == "span" and re.match("T", tag.text)
In [88]:soup.find_all(aa)[0]

这是怎么回事?

1 个答案:

答案 0 :(得分:1)

根据文档(http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-text-argument),您的代码应该有效。您应该提交错误报告。

编辑:看起来此问题是由<br>元素中的<span>标记引起的。这绝对是一个错误。

要解决这个问题,请使用lambda,这样您就不需要定义函数了:

soup.find_all(lambda tag: tag.name == "span" and re.match("T", tag.text))