Question

网页上有我需要检索的产品代码，它位于以下HTML部分：

<table...>
<tr>
 <td>
 <font size="2">Product Code#</font>
 <br>
 <font size="1">2342343</font>
 </td>

</tr>
</table>

所以我想最好的方法是首先引用带有文本值'Product Code＃'的html元素，然后引用TD中的第二个字体标记。

想法？

Answer 1

假设soup是BeautifulSoup个实例：

int(''.join(soup("font", size="1")[0](text=True)))

或者，如果您需要获得多个产品代码：

[int(''.join(font(text=True))) for font in soup("font", size="1")]

Answer 2

我的策略是：

查找与字符串“Product Code＃”匹配的文本节点
对于每个此类节点，获取父元素并找到父级的下一个兄弟元素
将兄弟元素的内容插入列表

代码：

from BeautifulSoup import BeautifulSoup


html = open("products.html").read()
soup = BeautifulSoup(html)

product_codes = [tag.parent.findNextSiblings('font')[0].contents[0]
                 for tag in 
                 soup.findAll(text='Product Code#')]

Answer 3

你可以使用这个正则表达式（或类似的东西）：

<td>\n\ <font\ size="2">Product\ Code\#\n\ \n\ <font\ size="1">(?<ProductCode>.+?)\n\ </td>

根据你的RegExp引擎你可能会删除一些转义...我很谨慎。

Answer 4

不要使用正则表达式来解析HTML。我将使用以下XPATH执行此任务：

//TABLE/TR/TD/FONT[@size='1']

或者，如果不保证字体大小属性在那里并且等于1：

//FONT[text()='Product Code#']/parent::*/FONT[2]

使用Beautiful Soup帮助从HTML检索产品代码

4 个答案: