Question

我试图在一个以哈希标记开头的div中提取名称。

<div class="h_names">#jason, #michael, #sam, etc...</div>

因此，我的结果将是jason，michael，sam等列表。

我不确定如何使用BeautifulSoup来做到这一点。

import bs4

soup = bs4.BeautifulSoup(html)
div = soup.find('div', {'class' : 'h_names'})

这会找到div，但我需要一个正则表达式来提取名称

Answer 1

这并不使用正则表达式，但我认为您不需要使用正则表达式，或者导入任何新内容，因为BeautifulSoup为您提供内置方法来提取文本来自html。

如果 div 是：

'<div class="h_names">#jason, #michael, #sam</div>' # without the etc.. bit

，然后

div = soup.find('div', {'class' : 'h_names'}) names = [str(name.strip()[1:]) for name in div.text.split(',')]

<强>输出：

>>> print names ['jason', 'michael', 'sam']

names是使用list comprehension创建的。

列表理解中的字符串转换（使用str()）是必需的，因为div（text）上的div.text方法返回unicode字符串（例如：u'jason'）

[1:]的字符串切片用于切掉每个字符串的第一个字符（在这种情况下为＆＃39;＃＆＃39;）

strip字符串方法（str.strip()），只是切断任何前导或尾随空格，以及换行符（\n）

Answer 2

您可以使用re.findall()来匹配div元素内的条件。

import bs4
import re

soup  = bs4.BeautifulSoup(html)
div   = soup.find('div', {'class' : 'h_names'})
names = re.findall(r'#([a-zA-Z]+)', str(div.text))

输出

['jason', 'michael', 'sam']

返回bs4中的名称列表

2 个答案: