Question

我正在尝试在打开<td>标记后立即抓取字符串。以下代码有效：

webpage = urlopen(i).read()
soup = BeautifulSoup(webpage)
for elem in soup('td', text=re.compile(".\.doc")):
    print elem.parent

当html看起来像这样：

<td>plan_49913.doc</td>

但不是html看起来像这样：

<td>plan_49913.doc<br /> <font color="#990000">Document superseded by:  </font><a href="/plans/Jan_2012.html">January 2012</a></td>

我尝试过玩游戏，但无法让它工作。基本上我只想在html的任一个实例中获取'plan_49913.doc'。

任何建议都将不胜感激。

提前谢谢。

〜作者chrisk

Answer 1

这对我有用：

>>> html = '<td>plan_49913.doc<br /> <font color="#990000">Document superseded by: &#160;</font><a href="/plans/Jan_2012.html">January 2012</a></td>'
>>> soup = BeautifulSoup(html)
>>> soup.find(text=re.compile('.\.doc'))
u'plan_49913.doc

我有什么遗失的吗？

另外，请注意根据文档：

如果您使用文本，则忽略您为name和关键字参数提供的任何值。

所以你不需要传递'td'，因为它已被忽略，也就是说，将返回任何其他标签下匹配的任何文本。

Answer 2

只需使用next属性，它包含下一个节点，那就是文本节点。

>>> html = '<td>plan_49913.doc<br /> <font color="#990000">Document superseded by: &#160;</font><a href="/plans/Jan_2012.html">January 2012</a></td>'
>>> bs = BeautifulSoup(html)
>>> texts = [ node.next for node in bs.findAll('td') if node.next.endswith('.doc') ]
>>> texts
[u'plan_49913.doc']

如果您愿意，可以更改if子句以使用正则表达式。

美丽的汤 - 在第一个指定标签后抓取字符串

2 个答案: