Question

我尝试使用string argument从此处描述的某些html元素中解析出文本，但失败了。我尝试了两种不同的方法，但是每次遇到相同的AttributeError时，我都会尝试。

在这种情况下，如何使用字符串参数来获取文本？

我尝试过：

import re
from bs4 import BeautifulSoup

htmlelement = """
<caption>
    <span class="toggle open"></span>
    ASIC registration
</caption>
"""
soup = BeautifulSoup(htmlelement,"lxml")
try:
    item = soup.find("caption",string="ASIC registration").text
    #item = soup.find("caption",string=re.compile("ASIC registration",re.I)).text
except AttributeError:
    item = ""
print(item)

预期的输出（only using string argument）：

ASIC registration

Answer 1

您遇到的问题是string参数搜索字符串，而不是您链接的文档中声明的标记。

您使用的语法：

soup.find("caption",string="ASIC registration")

用于查找标签。

查找字符串：

soup.find(string=re.compile('ASIC'))

对于第一个，您要说的是找到带有字符串的“字符串”属性的字幕标签。 caption标签没有字符串属性，因此不会返回任何内容。

第二个是说找到包含'ASIC'的字符串，因此它返回字符串。

Answer 2

在这种情况下，如何使用字符串参数来获取文本？

您不能

注意：我假设您是指

中的一些更改字符串参数

item = soup.find("caption",string="ASIC registration").text

如documentation

中所述

如果代码只有一个孩子，并且该孩子是NavigableString，则子项以.string形式提供：

import re
from bs4 import BeautifulSoup
htmlelement = """
<caption>
    <span class="toggle open"></span>
    ASIC registration
</caption>
"""
soup = BeautifulSoup(htmlelement,"lxml")
item = soup.find("caption")
print(item.string)

输出

None

这里的.string是None，因为字幕有多个孩子。

如果您尝试获取带有文本的父项（在这种情况下为标题标签），您可以这样做

item = soup.find(string=re.compile('ASIC registration')).parent

这将给

<caption><a></a>ASIC registration</caption>

当然，在父标签上调用.text会得到该标签中的全文，如果不是其中的全文。

item = soup.find(string=re.compile('ASIC')).parent.text

将给出输出

ASIC registration

Answer 3

如果标签具有子标签，则证明string参数不起作用。以下代码很愚蠢，但是可以正常工作：

real_item = ""
try:
    items = soup.find_all("caption")
    r = re.compile(u"ASIC registration", re.I)
    for item in items:
        for s in item.strings:
            if r.search(unicode(s)):
                real_item = item
                break

except AttributeError:
    real_item = ""
print(real_item)

字符串参数在我的脚本中的行为有所不同

3 个答案: